HTML discovery: SGML entities and charsets
Julian Reschke
julian.reschke at gmx.de
Mon May 28 14:30:59 UTC 2007
Peter Watkins wrote:
> 7.3.3 in draft 11 says
>
> The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities other than "&", "<", ">", and """. Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).
>
> Questions:
>
> 1) Why are the characters &, <, >, and " allowed to be represented with those
> SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
> %3E, and %22?
"<" and ">" are not allowed in URLs anyway. An ampersand can appear in a
URL, in which case it would have different semantics than %26.
> 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
> values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
> to understand different HTML character sets, and would allow users to encode
> their HTML delivery pages in the charset of their choosing. As it stands,
> it appears that the HTML document containing the LINK tags could be encoded
> in any charset, with the RP responsible for decoding. With the existence
> of "internationallized" domain names, it's quite possible that the provider
> and local_id values will contain non-ASCII characters. Specifying UTF-8
> encoding for HTML discovery will allow leaner, more reliable RP code.
The value of the href attribute of an HTML link is a URI, and URIs do
not contain non-ASCII characters by definition.
Best regards, Julian
More information about the specs
mailing list