HTML parsing in HTML-based discovery (was: DRAFT 11 -> FINAL?)

Claus Färber gmane at faerber.muc.de
Fri Jan 26 12:21:16 UTC 2007


Martin Atkins schrieb:
> Since your list is long, I'm only going to address things I have an 
>> | 7.3.3.  HTML-Based Discovery
> In practice, few implementations actually use an HTML parser to find 
> these elements. These extra restrictions are present to facilitate 
> regex-based parsing.

Yes, and this is the problem. Implementors may *think* that they can get 
away with regexp parsing when in fact they can't. HTML/XHTML requires a 
context-free parser, which is one level above regular expressions in the 
Chomsky hierarchy.

Even if they start mixing regexps and other code, it is likely that they 
won't handle all the corner-cases of HTML correctly.
The effect is that an OpenID login may work on 80% of all sites ... and 
not on the other 20% that use a different parser. And the user will not 
even know _why_ his login fails. After all, validators and other HTML 
checking tools will tell him that his site is valid HTML.
It's even possible that some parsers fail on things other parsers require.

> The regex-based parsers employed by existing implementations require 
> explicit <head> start and end tags. I agree that this is not ideal, but 
> it's hardly an onerous requirement on document authors.

Currently, the spec does not require explicit start and end tags for the 
HEAD element. It talks about a "HEAD section", which is always there 
even if it is not marked (see 
<http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.1>)
This is already an imcompatibility caused by unclear wording.

In order to facilitate regexp parsing, just requiring the start and end 
tags is not enough. Additional restrictions may also be necessary to 
avoid cases where too simple regexp-based parsers might fail:

- <head> start with attributes.
- order of attributes within the <LINK> tag.
- single quotes vs. double quotes vs. no quotes.
- unescaped "<"/">" within attributes.*
- numeric character references.*
- line feeds within tags.*
- additional XML namespaces that allow attributes like foo:href.*
- <LINK> tags within <!-- comments -->.*
- [to be continued]

(* = inspired by a real-world implementation failing to handle these 
cases correctly)

If you want to handle all of these correctly, you already need a true 
HTML parser.

The less of these restrictions are added, the less likely will it be 
that regexp-based parsers interoperate.

> This is mostly an ideological argument founded on whether we're allowed 
> to impose additional restrictions on HTML documents that are making use 
> of OpenID discovery. There is certainly no *practical* reason why this 
> shouldn't be done, assuming that the restrictions are sufficient to 
> prevent the above attack.

There are practical problems:

* Users can't use existing HTML tools to check for the additional
   restrictions. A validator will say "valid HTML" but the OpenID
   login fails due to a "parsing error" (e.g. the PHP implementation used
   on OpenID Enabled). And different RP will choke on different things.

* Users can't use existing HTML tools that do not honor the additional
   restrictions. A HTML pretty-printer may simply re-format the code in
   a way unparsable by ad-hoc parsers; a hypothetical htmlcrush program
   might may remove the optional quotes, entity references and tags in
   good faith.

* Other specs might also impose restrictions, which can be incompatible
   with OpenID's restrictions.

The more restrictions are added, the more likely will it be that these 
practical problems arise.

Claus




More information about the specs mailing list