[OpenID] Canonical OpenID url form

Andrew Arnott andrewarnott at gmail.com
Wed Jul 9 05:32:19 UTC 2008


Thanks, Johnny.  I've had some conversations with a few other people who
draw the opposite conclusion and believe that the %AB%CD notation is the
canonical form.

You make a good point about having to unescape the characters from the URI
just above the transport layer, but
I believe you're applying section 4.1 to the URL when it should only be
applied to the key/value pairs.  The OpenID ClaimedIdentifier, which by the
spec is the last URL to respond without an HTTP redirect, cannot be in
unicode by the URI specification because unicode characters are not allowed,
whether that is UTF8 or UTF16.

Name/value pairs passed as part of a querystring may (and as the section you
quote requires) be encoded as UTF-8, but they are subsequently URI encoded
as %AB%CD hex characters (thus doubly encoded) so they are actually no
longer UTF-8 at the transport layer.  Since the OpenID URL, around which all
the identity of OpenID is focused (omiting XRIs which don't suffer from this
problem) *is* at the transport layer of the way the security requirements
force the claimed identifier to be discovered, is all about the transport
layer, I believe it would be a mistake to add semantics on top of that and
call it canonical.

What I also realized from some other conversations is that this doesn't
really matter.  As long as an OP or RP is consistent within itself in
storing and comparing Claimed Identifiers, whether it stores and compares
%AB%CD or the unicode equivalent character won't matter to anyone, since on
the protocol/wire level it is always %AB%CD.  However, I think unescaping
the URL and getting the original unicode characters back is very useful and
should be done for purposes of displaying to the user.

I think for the security and guaranteed identity of the protocol, there is a
meaningful side to this though.  It has not got to do with how the claimed
identifier is stored, but rather how a unicode string is escaped for URI
transport.  A given unicode string may be represented by more than just one
series of bytes.  Unicode characters exist that in UTF-8 or UTF-16 have
multiple byte sequences *for the same character*.  Therefore someone who is
typing in their OpenID url to a site using one method during one visit, and
then types it in to the same site using a different method on a subsequent
visit, will only be identified by the RP as the same visitor if OpenID
requires that the RP transforms whatever unicode string is given by the user
to the canonical byte form as defined by the unicode standard before
transit.  For example, the letter 'Á' can be encoded as a single character
or using composition by adding an accent to the A character.  Both are
legal, but the unicode standard defines one as canonical (I think).  But if
a string containing this character is not canonicalized first, then although
the character is equivalent to the user and to unicode, the encoded %AB%CD
string will be different, resulting in security problems for OpenID because
people could overload a single Identifier just by using different encodings
at an OP, or fail to log into an RP depending on how they craft their
string. By the way, I say 'unicode' in the strict sense, applying to UTF-8,
UTF-16, etc.  Unicode is commonly used to refer to just UTF-16, but this
problem applies to all unicode character sizes.

So I think OpenID should be more explicit about its unicode support for
Identifiers, including mandating a canonical Unicode form.

On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu at gmail.com> wrote:

>
> On 08/07/08 03:01 PM, Andrew Arnott wrote:
>
>> What is the canonical form of an OpenID URL? One with the %AB%CD hex
>> encoding for unicode chars in the URL or with the actual unicode chars? For
>> the purposes of displaying to the user and storing in the RP's database.
>>
>> The spec doesn't seem to have anything to say on this.
>>
>
> I believe it does say:
>
> 4.1.  Protocol Messages
> The OpenID Authentication protocol messages are mappings of plain-text keys
> to plain-text values. The keys and values permit the full Unicode character
> set (UCS). When the keys and values need to be converted to/from bytes, they
> MUST be encoded using UTF-8 [RFC3629].
>
> http://openid.net/specs/openid-authentication-2_0.html#anchor4
>
>  The reason I think it's not a simple automatic answer is the unicode chars
>> may be what the user typed in and what exists on the server, but in transit,
>> these characters are translated to %AB%CD in order to be validly escaped URI
>> strings.
>>
>
> The receiving party must decode them to the original form when they are
> extracted from the transport layer.
>
>  So one could argue that the unicode characters are never part of the
>> protocol
>>
>
> One would then be ignoring the parts of the protocol that do not deal with
> the transport layer directly.
>
>
> Johnny
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openid.net/pipermail/openid-general/attachments/20080708/312e8ada/attachment-0002.htm>


More information about the general mailing list