HTML parsing in HTML-based discovery (was: DRAFT 11 -> FINAL?)
gmane at faerber.muc.de
Fri Jan 26 12:21:16 UTC 2007
Martin Atkins schrieb:
> Since your list is long, I'm only going to address things I have an
>> | 7.3.3. HTML-Based Discovery
> In practice, few implementations actually use an HTML parser to find
> these elements. These extra restrictions are present to facilitate
> regex-based parsing.
Yes, and this is the problem. Implementors may *think* that they can get
away with regexp parsing when in fact they can't. HTML/XHTML requires a
context-free parser, which is one level above regular expressions in the
Even if they start mixing regexps and other code, it is likely that they
won't handle all the corner-cases of HTML correctly.
The effect is that an OpenID login may work on 80% of all sites ... and
not on the other 20% that use a different parser. And the user will not
even know _why_ his login fails. After all, validators and other HTML
checking tools will tell him that his site is valid HTML.
It's even possible that some parsers fail on things other parsers require.
> The regex-based parsers employed by existing implementations require
> explicit <head> start and end tags. I agree that this is not ideal, but
> it's hardly an onerous requirement on document authors.
Currently, the spec does not require explicit start and end tags for the
HEAD element. It talks about a "HEAD section", which is always there
even if it is not marked (see
This is already an imcompatibility caused by unclear wording.
In order to facilitate regexp parsing, just requiring the start and end
tags is not enough. Additional restrictions may also be necessary to
avoid cases where too simple regexp-based parsers might fail:
- <head> start with attributes.
- order of attributes within the <LINK> tag.
- single quotes vs. double quotes vs. no quotes.
- unescaped "<"/">" within attributes.*
- numeric character references.*
- line feeds within tags.*
- additional XML namespaces that allow attributes like foo:href.*
- <LINK> tags within <!-- comments -->.*
- [to be continued]
(* = inspired by a real-world implementation failing to handle these
If you want to handle all of these correctly, you already need a true
The less of these restrictions are added, the less likely will it be
that regexp-based parsers interoperate.
> This is mostly an ideological argument founded on whether we're allowed
> to impose additional restrictions on HTML documents that are making use
> of OpenID discovery. There is certainly no *practical* reason why this
> shouldn't be done, assuming that the restrictions are sufficient to
> prevent the above attack.
There are practical problems:
* Users can't use existing HTML tools to check for the additional
restrictions. A validator will say "valid HTML" but the OpenID
login fails due to a "parsing error" (e.g. the PHP implementation used
on OpenID Enabled). And different RP will choke on different things.
* Users can't use existing HTML tools that do not honor the additional
restrictions. A HTML pretty-printer may simply re-format the code in
a way unparsable by ad-hoc parsers; a hypothetical htmlcrush program
might may remove the optional quotes, entity references and tags in
* Other specs might also impose restrictions, which can be incompatible
with OpenID's restrictions.
The more restrictions are added, the more likely will it be that these
practical problems arise.
More information about the specs