HTML parsing in HTML-based discovery

Fri Jan 26 18:09:53 UTC 2007

Claus Färber wrote:
> 
> In order to facilitate regexp parsing, just requiring the start and end 
> tags is not enough. Additional restrictions may also be necessary to 
> avoid cases where too simple regexp-based parsers might fail:
> 
> - <head> start with attributes.
> - order of attributes within the <LINK> tag.
> - single quotes vs. double quotes vs. no quotes.
> - unescaped "<"/">" within attributes.*
> - numeric character references.*
> - line feeds within tags.*
> - additional XML namespaces that allow attributes like foo:href.*
> - <LINK> tags within <!-- comments -->.*
> - [to be continued]
> 
> (* = inspired by a real-world implementation failing to handle these 
> cases correctly)
> 

And, in theory, the OpenID spec could add additional restrictions to 
"fix" the above problems.

Whether it should or not is of course up for debate; I'd be interested 
to hear from Brad Fitzpatrick and JanRain's developers who are 
responsible for the most-used implementations currently using regex 
parsing. Why didn't you guys use an HTML parser? I assume there must 
have been a reason.

> 
>> This is mostly an ideological argument founded on whether we're allowed 
>> to impose additional restrictions on HTML documents that are making use 
>> of OpenID discovery. There is certainly no *practical* reason why this 
>> shouldn't be done, assuming that the restrictions are sufficient to 
>> prevent the above attack.
> 
> There are practical problems:
> 
> * Users can't use existing HTML tools to check for the additional
>    restrictions. A validator will say "valid HTML" but the OpenID
>    login fails due to a "parsing error" (e.g. the PHP implementation used
>    on OpenID Enabled). And different RP will choke on different things.

An HTML validator also won't help them if they transpose the values of 
openid.server and openid.delegate, or if they type rel="opnid.server" 
instead. There are OpenID-specific "validation"/checking tools in the 
works which will hopefully be able to give users good information about 
potential pitfalls with the way they have written their HTML in addition 
to pointing out things like that the openid.server LINK is missing.

> * Users can't use existing HTML tools that do not honor the additional
>    restrictions. A HTML pretty-printer may simply re-format the code in
>    a way unparsable by ad-hoc parsers; a hypothetical htmlcrush program
>    might may remove the optional quotes, entity references and tags in
>    good faith.

Indeed. But those documents wouldn't conform to the OpenID 
specification. (assuming that it went into more detail about the 
restrictions it is adding to HTML.)

I think the main point here is that despite the outcome of this debate 
people *will* write regex-based parsers, whether the spec allows for it 
or not. We have a choice between ignoring the issue so that all of these 
regex-based parsers fail in interesting ways on odd cases, or accepting 
that this is inevitable and listing in detail a set of rules for 
regex-based parsing in addition to a set of restrictions on HTML that 
make those parsing rules possible.

I'd love it if everyone would use proper HTML or XML parsers, but that 
just isn't going to happen no matter how much we wish it would. In the 
end "almost there but not quite" implementations hurt no-one but the 
end-user, and OpenID is what will get the blame for any negative user 
experience, not the libraries that use incompatible regex-based parsers.