HTML discovery: SGML entities and charsets

Peter Watkins peterw at tux.org
Fri May 18 20:35:10 UTC 2007


7.3.3 in draft 11 says

The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities other than "&", "<", ">", and """. Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).

Questions:

1) Why are the characters &, <, >, and " allowed to be represented with those
SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
%3E, and %22? 

2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
to understand different HTML character sets, and would allow users to encode
their HTML delivery pages in the charset of their choosing. As it stands, 
it appears that the HTML document containing the LINK tags could be encoded 
in any charset, with the RP responsible for decoding. With the existence 
of "internationallized" domain names, it's quite possible that the provider 
and local_id values will contain non-ASCII characters. Specifying UTF-8 
encoding for HTML discovery will allow leaner, more reliable RP code.

-Peter




More information about the specs mailing list