HTML discovery: SGML entities and charsets

Mon May 28 15:28:47 UTC 2007

Peter Watkins schrieb:
> I don't think it's reasonable to expect RP code to be capable of parsing
> every possible charset in which an HTML page might be encoded.
> 
> I also don't think it's reasonable to specify specific charsets that RPs
> should be able to decode and then require OpenID users to use those charsets
> in their web pages just so RPs can parse these two <link> elements.
> 
> I believe the contents of those two tags' HREF attributes should be defined
> as UTF-8 representations of the URLs, encoded per RFC 3986.

URIs are always confined to a small number of characters, roughly a
subset of the ASCII repertoire. Characters, which may be represented in
ASCII, UTF-32, EBCDIC, ink on paper, etc., not bytes (or coded characters)

Non-ASCII characters (or special characters) are not a concern when
embedding finished URIs in HTML documents. They are only a concern when
making the URIs (see below). Percent-encoding is only a concern when
making the URIs.

Actually, just two characters may need to be encoded when a URI is
embedded in a HTML document: "&" and "'" (the later only if the
attribute is for some inexplicable reason using "'" as quotes).
Only '&' has a named entity: '&amp;'. All the others defined in HTML are 
either above U+007E or specials not allowed within URIs.
However, any other character *may* be encoded. For example, '@' might
be encoded as '&#x40' and 'A' might be encoded as '&#65'.

Actually, handling different legacy charsets is very easy: If they're
an extended ASCII charset, just don't try to interpret a character with
a set 8th bit.
That does not work with UTF-16, UTF-32, ISO 2002 and EBCDIC, however.

So the two questions to answer here are:

. What charsets does a RP need to be able to handle?
   - extended ASCII (including UTF-8, ISO 8859, GB 18030)
   - UTF-16 (including endian detection)
   - UTF-32?
   - ISO 2022 (a switching charset that might fool ASCII parsers,
       any sequence not in the ASCII plane can be ignored just like
       8bit chars with extended ASCII)?
   - EBCDIC?

. What character references does a RP need to handle?
   - entity references (i.e. '&amp;')
   - numeric character references ('#&xNN' and '#&NNN')

> A link in a big5 HTML document to an internationalized URL 
> may not be deciperable by my web browser, and that's normally OK because
> an internationalized Chinese URL in a Chinese-language document is probably 
> nothing I could read, anyway. HTML is designed for human communication.

Well, if we're talking about IRIs (Internationalised Resource
Identifiers), that's a completely different story.

Like URIs, they are made of characters. However, these characters may
now be characters above U+007E.
When embedding them in HTML, there's a lot of additional named entity
references.
Further, you can't get away with just handling extended ASCII as ASCII.

URIs can be mapped to IRIs by undoing the percent-encoding for bytes
that are valid UTF-8 sequences and interpreting the result as UTF-8.

For example, <http://example.com/f%C3%C4rber> can be mapped to
<http://example.com/färber>.
However, <http://example.com/f%E4rber> can not be mapped to an IRI (i.e.
the IRI is just identical to the URI).

Currently, the HTML 4.01 spec does not formally allow IRIs. However, the
HTML 5 draft does.

With all of this, the real question here is:

. Should support for IRIs be required?

If IRIs are allowed, the number of charset and named entity references a
RP must be able to handle, is much larger. So if yes, the same questions
as above come up again:

. What charset does a RP need to be able to handle?
   - ISO 8859-X, Windows-1252?
   - UTF-8
   - GB 18030
   - EUC
   - UTF-16 (including endian detection)
   - UTF-32?
   - ...

. What character references does a RP need to handle?
   - entity references (full HTML list)
   - numeric character references ('#&xNN' and '#&NNN')

> Instead of thinking of the OpenID2 values as text, think of them as
> binary data that a machine needs to read. If an internationalized Chinese URL
> is converted to UTF-8 bytes and then URI-encoded, it is then reduced to lowest-
> common-denominator text: US-ASCII.

That's basically what URIs already do. No need to reinvent the wheel.

> Consider an identity URL like http://www.färber.de/claus
> 
> In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
> UTF-8 representation of http://www.färber.de/claus would be
>   http://www.f%C3%A4rber.de/claus

Or just <http://www.xn--frber-gra.de/claus>, which also works with
software that can't handle IDNs at all.

It does not work like that with the path component of HTTP URIs,
however. <http://example.com/f%E4rber> (using ISO 8859-1),
<http://example.com/f%7Brber> (ISO 646 DE) and
<http://example.com/f%C3%C4rber> (UTF-8) are all valid URIs.

As a general rule, URIs contain bytes (possibly percent-encoded), not
characters. The mapping between these bytes and characters can be made
by the URI specification (e.g. domain names), by the server that hosts
the resource (e.g. a Windows HTTP server[1]) or even not at all (e.g.
data URIs, a POSIX HTTP server[2]).

Well, that said, the question is:

. Should support for IDNA be required?

Note that this question is still valid if IRIs are not allowed: You can
still write the URI <http://f%C3%C4rber.de> if you can't write the IRI
<http://färber.de>.

> My concrete suggestion: replace the current language
> 
> Other characters that would not be valid in the HTML document or that cannot 
> be represented in the document's character encoding MUST be escaped using 
> the percent-encoding (%xx) mechanism described in [RFC3986].
> 
> with this:
> 
> Any character in the href attributes MAY be represented as UTF-8 data escaped 
> using the percent-encoding (%xx) mechanism described in [RFC3986]. 

No! Definitely not. That's still a layer violation.

If I have URI http://example.com/f%E4rber, that's a legal URI and the
OpenID spec has no business telling me what URIs I may have.
Further, it would allow encoding http://example.com/foo?bar=1 as
http://example.com/foo%2fbar%3d1 ("any character"), which is just wrong.

Unfortunately, there's a dilemma: Be to strict in the OpenID spec and
some HTML tools will break the links. Be to lenient and some RPs can do
a HTML discovery where others can't. Bad user experience is the result.

Well, here's my suggestion:
----------------------------------------------------------------------
The HTML document MUST use a charset that is a superset of US-ASCII for
bytes in the range 0x09-0x0D, 0x20-0x27, 0x2C-0x5F, and 0x61-0x7A, 0x7E
  and that does not use the bytes 0x22, 0x26, 0x27, 0x3C within
multibyte sequences, or one of the following charsets: UTF-16, UTF-16BE,
UTF-16LE, UTF-32, UTF-32BE, or UTF-32LE. Relying Parties MUST support
these charsets.

The "openid2.provider" and "openid2.local_id" URLs MUST NOT be IRIs
[<insert reference>] unless the HTML document is in the one of the
following charsets: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE,
or UTF-32LE. Relying Parties MUST support IRIs in documents with these
charsets.

Relying Parties MUST parse HTML [<insert reference>] correctly and they
MUST decode all entity references and numeric character references
(hexadecimal and decimal) correctly. They MUST support internationalized
domain names [<insert reference>] within URIs and IRIs.
----------------------------------------------------------------------
Yes, I know that this lays a burden on the implementers of Relying
Parties. However, I think that avoiding compatibility problems, which 
might contribute to a bad reputation of OpenID, is worth it.

After all, there are only a few reference implementations (libraries)
used by most sites.

[N.B.: I'm reposting this message because the first version seems not to 
have made it through Gmane and/or the list. I used that opportunity to 
fix some typos.]

Claus
____________________
[1] Windows' NTFS uses Unicode for file names.
[2] POSIX-compatible systems just use uninterpreted bytes as file names.
     Whatever bytes the user uses for the file name is found in the URI.