HTML discovery: SGML entities and charsets
Claus Färber
GMANE at faerber.muc.de
Sun May 20 19:28:58 UTC 2007
Peter Watkins schrieb:
> 7.3.3 in draft 11 says
>
> The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities other than "&", "<", ">", and """. Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).
Please note that the draft is completely broken here:
It's unclear: The first sentence talks about "entities", which can only
refer to "character entity references" (HTML 4.01, 5.3.2). The second
sentence mandates RFC 3986 encoding, which is plain wrong because it
changes the URI. It does not talk about "numeric character references"
at all (which are _not_ entities, see HTML 4.01, 5.3.1), which is the
only correct way to encode an URI that contains a "'"/"'"/"'".
It's incompatible: A HTML editor, tool or filter may assume that
changing any characters to entities is allowed, so it may change
"http://example.org?login=user@example.net" to
"http://example.org?login=user@example.net" withoug changing the
meaning. The spec breaks this assumption.
It dangerous: It's there to allow RP implementations to use a quick and
dirty regexp-based parser instead of a true HTML parser, which (a) may
break with completly valid HTML documents (bad user experience) and (b)
may circumvent security measures taken by the site owners.
> 1) Why are the characters &, <, >, and " allowed to be represented with those
> SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
> %3E, and %22?
The point of RFC 3986 encoding is that URL special chars lose their
special meaning _within_ _the_ _URL_:
http://example.org/?foo=1&bar=2 contains two parameters: "foo" with the
value "1" and "bar" with the value "2".
http://example.org/?foo=1%26bar=2 contains a _signle_ parameter, "foo",
with the value "1&bar2".
The point of HTML encoding is that HTML special chars lose their special
meaning _within_ _HTML_:
<a href="http://example.org/?x=1©=2"> is a link to the IRI
http://example.org/?x=1©=2, which is equivalent with the ASCII URI
http://example.org/?x=1%C2%A9%3D2.
<a href="http://example.org/?x=1&copy=2"> is a link to the URI
http://example.org/?x=1©=2
However, "<" and ">" are not legal within URIs and IRIs anyways. Other
characters with named entities are outside the ASCII range and thus
illegal in URIs but not IRIs.
> 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
> values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
> to understand different HTML character sets, and would allow users to encode
> their HTML delivery pages in the charset of their choosing.
No, the whole HTML document must use the same character set.
However, unless you're using IRIs, you can usually get away with
treating the document as ASCII; you'll have some characters with the 8th
bit set but you can simply ignore them if you just want to extract URIs.
Problematic charsets include ISO-2022 (common), Shift-JIS (very common,
only "~" a problem wrt URIs, which can't be encoded at all), UTF-16
(rare), UTF-32 (very rare), EBCDIC-based charsets (very rare) and
national ISO-646 variants.
Claus
More information about the specs
mailing list