HTML discovery: SGML entities and charsets

Wed May 23 23:17:38 UTC 2007

On Mon, May 21, 2007 at 11:50:32AM -0700, Josh Hoyt wrote:

> On 5/20/07, Claus Färber <GMANE at faerber.muc.de> wrote:
> > Peter Watkins schrieb:
> > > 7.3.3 in draft 11 says
> > >
> > > The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities other than "&amp;", "&lt;", "&gt;", and "&quot;". Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).
> >
> > Please note that the draft is completely broken here:
> 
> Can you suggest improvements and examples or test cases of how you
> think it should work?
> 
> There has been a little discussion in the past about the restriction
> on allowed character entity references. I don't think there has been
> any about numeric character references, except in lumping them in with
> character entity references.
> 
> These restrictions live on from the OpenID 1 specification, and were
> preserved primarily to ease backwards compatibility (IIRC).

I don't think it's reasonable to expect RP code to be capable of parsing
every possible charset in which an HTML page might be encoded.

I also don't think it's reasonable to specify specific charsets that RPs
should be able to decode and then require OpenID users to use those charsets
in their web pages just so RPs can parse these two <link> elements.

I believe the contents of those two tags' HREF attributes should be defined
as UTF-8 representations of the URLs, encoded per RFC 3986.

As Claus has pointed out, this is NOT a normal way of embedding *text*
within an SGML/HTML document. That's true. Normally the HTML document would
contain text either in the appropriate bytes for the page's charset or in SGML 
entity representations. That generally works for HTML because HTML is
designed to be read by humans. If my browser doesn't understand the "big5"
charset used for Chinese text, that's normally OK because I cannot read
Chinese. A link in a big5 HTML document to an internationalized URL 
may not be deciperable by my web browser, and that's normally OK because
an internationalized Chinese URL in a Chinese-language document is probably 
nothing I could read, anyway. HTML is designed for human communication.

But we're not talking about "text" here, and there's no expectation that the
RP should be able to "read" the text in the HTML document at the user's claimed
identity. Instead of thinking of the OpenID2 values as text, think of them as
binary data that a machine needs to read. If an internationalized Chinese URL
is converted to UTF-8 bytes and then URI-encoded, it is then reduced to lowest-
common-denominator text: US-ASCII. It's an easy matter for the RP to extract 
that and convert it back to a Unicode string, and process it properly.

Consider an identity URL like http://www.färber.de/claus

In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
UTF-8 representation of http://www.färber.de/claus would be
  http://www.f%C3%A4rber.de/claus
If the OpenID 2.0 spec made it clear that the value of these HTML discovery
attributes was to be decoded by
 1st: applying RFC3986 decoding to convert %NN values to bytes
 2nd: interpreting as UTF-8
then the string "http://www.f%C3%A4rber.de/claus" is not ambiguous at all.

This is a compromise -- making decoding simpler for RPs and allowing simple
"straight" URLs for common ASCII-only URLs like "https://www.faerber.de/claus".
It also seems to be in accord with the W3C's stance on Internationalized
Resource Identifiers: http://www.w3.org/International/O-URL-and-ident.html

"URIs

"Internationalization of URIs is important because URIs may contain all kinds 
of information from all kinds of protocols or formats that use characters 
beyond ASCII. The URI syntax defined in RFC 2396 currently only allows as 
subset of ASCII, about 60 characters. It also defines a way to encode 
arbitrary bytes into URI characters: a % followed by two hexadecimal digits 
(%HH-escaping). However, for historical reasons, it does not define how 
arbitrary characters are encoded into bytes before using %HH-escaping.

"Among various solutions discussed a few years ago, the use of UTF-8 as 
the preferred character encoding for URIs was judged best. This is in line 
with the IRI-to-URI conversion, which uses encoding as UTF-8 and then 
escaping with %hh:"

As for Claus' HTML editing software dilemma, I retract my comment about the
SGML entities enumerated in 7.3.3 of draft 11. 

My concrete suggestion: replace the current language

Other characters that would not be valid in the HTML document or that cannot 
be represented in the document's character encoding MUST be escaped using 
the percent-encoding (%xx) mechanism described in [RFC3986].

with this:

Any character in the href attributes MAY be represented as UTF-8 data escaped 
using the percent-encoding (%xx) mechanism described in [RFC3986]. Characters 
with Unicode values greater than u007E MUST be represented as UTF-8 data 
escaped using the percent-encoding (%xx) mechanism described in [RFC3986]. 
For instance, the character "ä" (umlaut a, Unicode u00E4) MUST be represented 
as a six-character string like "%C3%A4" as suggested by RFC2718.

-Peter