HTML discovery: SGML entities and charsets

Julian Reschke julian.reschke at gmx.de
Mon May 28 14:30:59 UTC 2007


Peter Watkins wrote:
> 7.3.3 in draft 11 says
> 
> The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities other than "&", "<", ">", and """. Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).
> 
> Questions:
> 
> 1) Why are the characters &, <, >, and " allowed to be represented with those
> SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
> %3E, and %22? 

"<" and ">" are not allowed in URLs anyway. An ampersand can appear in a 
URL, in which case it would have different semantics than %26.

> 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
> values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
> to understand different HTML character sets, and would allow users to encode
> their HTML delivery pages in the charset of their choosing. As it stands, 
> it appears that the HTML document containing the LINK tags could be encoded 
> in any charset, with the RP responsible for decoding. With the existence 
> of "internationallized" domain names, it's quite possible that the provider 
> and local_id values will contain non-ASCII characters. Specifying UTF-8 
> encoding for HTML discovery will allow leaner, more reliable RP code.

The value of the href attribute of an HTML link is a URI, and URIs do 
not contain non-ASCII characters by definition.

Best regards, Julian




More information about the specs mailing list