HTML parsing in HTML-based discovery
mart at degeneration.co.uk
Fri Jan 26 18:09:53 UTC 2007
Claus Färber wrote:
> In order to facilitate regexp parsing, just requiring the start and end
> tags is not enough. Additional restrictions may also be necessary to
> avoid cases where too simple regexp-based parsers might fail:
> - <head> start with attributes.
> - order of attributes within the <LINK> tag.
> - single quotes vs. double quotes vs. no quotes.
> - unescaped "<"/">" within attributes.*
> - numeric character references.*
> - line feeds within tags.*
> - additional XML namespaces that allow attributes like foo:href.*
> - <LINK> tags within <!-- comments -->.*
> - [to be continued]
> (* = inspired by a real-world implementation failing to handle these
> cases correctly)
And, in theory, the OpenID spec could add additional restrictions to
"fix" the above problems.
Whether it should or not is of course up for debate; I'd be interested
to hear from Brad Fitzpatrick and JanRain's developers who are
responsible for the most-used implementations currently using regex
parsing. Why didn't you guys use an HTML parser? I assume there must
have been a reason.
>> This is mostly an ideological argument founded on whether we're allowed
>> to impose additional restrictions on HTML documents that are making use
>> of OpenID discovery. There is certainly no *practical* reason why this
>> shouldn't be done, assuming that the restrictions are sufficient to
>> prevent the above attack.
> There are practical problems:
> * Users can't use existing HTML tools to check for the additional
> restrictions. A validator will say "valid HTML" but the OpenID
> login fails due to a "parsing error" (e.g. the PHP implementation used
> on OpenID Enabled). And different RP will choke on different things.
An HTML validator also won't help them if they transpose the values of
openid.server and openid.delegate, or if they type rel="opnid.server"
instead. There are OpenID-specific "validation"/checking tools in the
works which will hopefully be able to give users good information about
potential pitfalls with the way they have written their HTML in addition
to pointing out things like that the openid.server LINK is missing.
> * Users can't use existing HTML tools that do not honor the additional
> restrictions. A HTML pretty-printer may simply re-format the code in
> a way unparsable by ad-hoc parsers; a hypothetical htmlcrush program
> might may remove the optional quotes, entity references and tags in
> good faith.
Indeed. But those documents wouldn't conform to the OpenID
specification. (assuming that it went into more detail about the
restrictions it is adding to HTML.)
I think the main point here is that despite the outcome of this debate
people *will* write regex-based parsers, whether the spec allows for it
or not. We have a choice between ignoring the issue so that all of these
regex-based parsers fail in interesting ways on odd cases, or accepting
that this is inevitable and listing in detail a set of rules for
regex-based parsing in addition to a set of restrictions on HTML that
make those parsing rules possible.
I'd love it if everyone would use proper HTML or XML parsers, but that
just isn't going to happen no matter how much we wish it would. In the
end "almost there but not quite" implementations hurt no-one but the
end-user, and OpenID is what will get the blame for any negative user
experience, not the libraries that use incompatible regex-based parsers.
More information about the specs