HTML discovery: SGML entities and charsets
Julian Reschke
julian.reschke at gmx.de
Mon May 28 14:23:26 UTC 2007
Peter Watkins wrote:
> I believe the contents of those two tags' HREF attributes should be defined
> as UTF-8 representations of the URLs, encoded per RFC 3986.
What is an "UTF-8" representation of a URL? A URL never ever contains
non-ASCII characters, by definition.
> But we're not talking about "text" here, and there's no expectation that the
> RP should be able to "read" the text in the HTML document at the user's claimed
> identity. Instead of thinking of the OpenID2 values as text, think of them as
> binary data that a machine needs to read. If an internationalized Chinese URL
> is converted to UTF-8 bytes and then URI-encoded, it is then reduced to lowest-
> common-denominator text: US-ASCII. It's an easy matter for the RP to extract
> that and convert it back to a Unicode string, and process it properly.
>
> Consider an identity URL like http://www.färber.de/claus
That's a IRI, not a URI or URL.
> In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded
> UTF-8 representation of http://www.färber.de/claus would be
> http://www.f%C3%A4rber.de/claus
Nope. You can't have "a umlaut" in a URI. You can have it in a IRI, in
which case RFC3987 describes the transformation to a URI. In this case,
the result will be different from your example, as the non-ASCII
character appears in the host name, for which different escaping rules
apply.
> ...
Best regards, Julian
More information about the specs
mailing list