HTML discovery: SGML entities and charsets

Claus Färber GMANE at faerber.muc.de
Mon May 28 09:45:08 UTC 2007


Josh Hoyt schrieb:
> There has been a little discussion in the past about the restriction
> on allowed character entity references. I don't think there has been
> any about numeric character references, except in lumping them in with
> character entity references.
> 
> These restrictions live on from the OpenID 1 specification, and were
> preserved primarily to ease backwards compatibility (IIRC).

It seems that it has been taken from the pingback specification:
http://www.hixie.ch/specs/pingback/pingback#TOC2.2

The rationale given is that it should not be necessary to implement a
full HTML parser. Unfortunatly, this allegation is completly bogous: As
HTML has a context-sensitive grammar, you just can't parse it with
regular expressions.

If you try, you will ineviteably write a parser that falls for some HTML
constructs users might expect to work. (For example: comments. It nearly
unimaginable but users might even try to put comment markers around an
OpenID link, add another OpenID link and expect RPs to use the one not
within a comment?)
Others will also try and will ineviteably write a parser that falls for
some _different_ HTML constructs.

The result is that one RP might work with an URL (because it can handle
comments within the HTML) and another one does not. Without looking at
the code of the RP's HTML parser, it is nearly impossible for the user
to tell why some RPs fail.
If that isn't extremly bad user experience, what is?

(As a side note: There's no telling whether there's a security risk with
some RPs, either.)

The only way around this is using a real HTML parser. If you do, there's
no reason not to parse and handle all character references.

Claus




More information about the specs mailing list