HTML discovery: SGML entities and charsets

Julian Reschke julian.reschke at gmx.de
Mon May 28 14:23:26 UTC 2007


Peter Watkins wrote:
> I believe the contents of those two tags' HREF attributes should be defined
> as UTF-8 representations of the URLs, encoded per RFC 3986.

What is an "UTF-8" representation of a URL? A URL never ever contains 
non-ASCII characters, by definition.

> But we're not talking about "text" here, and there's no expectation that the
> RP should be able to "read" the text in the HTML document at the user's claimed
> identity. Instead of thinking of the OpenID2 values as text, think of them as
> binary data that a machine needs to read. If an internationalized Chinese URL
> is converted to UTF-8 bytes and then URI-encoded, it is then reduced to lowest-
> common-denominator text: US-ASCII. It's an easy matter for the RP to extract 
> that and convert it back to a Unicode string, and process it properly.
> 
> Consider an identity URL like http://www.färber.de/claus

That's a IRI, not a URI or URL.

> In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
> UTF-8 representation of http://www.färber.de/claus would be
>   http://www.f%C3%A4rber.de/claus

Nope. You can't have "a umlaut" in a URI. You can have it in a IRI, in 
which case RFC3987 describes the transformation to a URI. In this case, 
the result will be different from your example, as the non-ASCII 
character appears in the host name, for which different escaping rules 
apply.

> ...

Best regards, Julian



More information about the specs mailing list