HTML discovery: SGML entities and charsets

Claus Färber GMANE at faerber.muc.de
Sun May 20 19:28:58 UTC 2007


Peter Watkins schrieb:
> 7.3.3 in draft 11 says
> 
> The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities other than "&", "<", ">", and """. Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).

Please note that the draft is completely broken here:

It's unclear: The first sentence talks about "entities", which can only 
refer to "character entity references" (HTML 4.01, 5.3.2). The second 
sentence mandates RFC 3986 encoding, which is plain wrong because it 
changes the URI. It does not talk about "numeric character references" 
at all (which are _not_ entities, see HTML 4.01, 5.3.1), which is the 
only correct way to encode an URI that contains a "'"/"'"/"'".

It's incompatible: A HTML editor, tool or filter may assume that 
changing any characters to entities is allowed, so it may change 
"http://example.org?login=user@example.net" to 
"http://example.org?login=user@example.net" withoug changing the 
meaning. The spec breaks this assumption.

It dangerous: It's there to allow RP implementations to use a quick and 
dirty regexp-based parser instead of a true HTML parser, which (a) may 
break with completly valid HTML documents (bad user experience) and (b) 
may circumvent security measures taken by the site owners.

> 1) Why are the characters &, <, >, and " allowed to be represented with those
> SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
> %3E, and %22? 

The point of RFC 3986 encoding is that URL special chars lose their 
special meaning _within_ _the_ _URL_:

http://example.org/?foo=1&bar=2 contains two parameters: "foo" with the 
value "1" and "bar" with the value "2".
http://example.org/?foo=1%26bar=2 contains a _signle_ parameter, "foo", 
with the value "1&bar2".

The point of HTML encoding is that HTML special chars lose their special 
meaning _within_ _HTML_:

<a href="http://example.org/?x=1&copy=2"> is a link to the IRI
http://example.org/?x=1©=2, which is equivalent with the ASCII URI 
http://example.org/?x=1%C2%A9%3D2.

<a href="http://example.org/?x=1&amp;copy=2"> is a link to the URI
http://example.org/?x=1&copy=2

However, "<" and ">" are not legal within URIs and IRIs anyways. Other 
characters with named entities are outside the ASCII range and thus 
illegal in URIs but not IRIs.

> 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
> values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
> to understand different HTML character sets, and would allow users to encode
> their HTML delivery pages in the charset of their choosing.

No, the whole HTML document must use the same character set.

However, unless you're using IRIs, you can usually get away with 
treating the document as ASCII; you'll have some characters with the 8th 
bit set but you can simply ignore them if you just want to extract URIs.

Problematic charsets include ISO-2022 (common), Shift-JIS (very common, 
only "~" a problem wrt URIs, which can't be encoded at all), UTF-16 
(rare), UTF-32 (very rare), EBCDIC-based charsets (very rare) and 
national ISO-646 variants.

Claus




More information about the specs mailing list