[OpenID] Canonical OpenID url form

Peter Williams pwilliams at rapattoni.com
Thu Jul 10 19:07:05 UTC 2008


I was thinking like the lazy programmer I am: use XRI libraries to address all advanced culture/language/encoding issues. Then, as you say, prefix that with a constant http://<int-domain>/. Then, per IRI conventions, rewrite that so, as in Arabic, one gets right to left URL visuals (<Arabic-script>//:ptth) to suit the population that is not particularly enamored with Roman culture.

-----Original Message-----
From: Drummond Reed [mailto:drummond.reed at cordance.net]
Sent: Thursday, July 10, 2008 11:31 AM
To: Peter Williams; 'Johnny Bufu'; 'Andrew Arnott'
Cc: 'OpenID List'
Subject: RE: [OpenID] Canonical OpenID url form

Martin's right, Peter -- XRI is one option for Unicode. But you can also use
an internationalized domain name
(http://en.wikipedia.org/wiki/Internationalized_domain_name) in a regular
URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).

You can also turn an XRI into an URL by adding an XRI proxy resolver prefix
(such as http://xri.net/ -- see my sig below for an example). In that
approach the proxy resolver prefix has nothing to do with the XRI itself, so
there's no need to internationalize the domain name.

=Drummond
http://xri.net/=drummond.reed


> -----Original Message-----
> From: Peter Williams [mailto:pwilliams at rapattoni.com]
> Sent: Wednesday, July 09, 2008 11:40 PM
> To: Drummond Reed; 'Johnny Bufu'; 'Andrew Arnott'
> Cc: 'OpenID List'
> Subject: RE: [OpenID] Canonical OpenID url form
>
> So the short form of the story is: use xri for unicode (and then transform
> the xri into an https hxri).
>
> Its been a month since I studied xri (and thus have forgotten 80 percent
> of it). I recall there was a syntax to identify the address of the initial
> resolver. Is there a way tha this became the domain name componnt of the
> hxri
>
> -----Original Message-----
> From: Drummond Reed <drummond.reed at cordance.net>
> Sent: Wednesday, July 09, 2008 11:34 PM
> To: 'Johnny Bufu' <johnny.bufu at gmail.com>; 'Andrew Arnott'
> <andrewarnott at gmail.com>
> Cc: 'OpenID List' <general at openid.net>
> Subject: Re: [OpenID] Canonical OpenID url form
>
>
> Also for the record, XRIs (which use the IRI character set) have a very
> simple defined transformation into IRIs. Thus when an XRI needs to be sent
> over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
> then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
> describes below. Reverse the process to display back to the user.
>
> See
> http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-
> cs.
> html for all the gory details (and they are gory - Unicode is hard).
>
> =Drummond
>
> > -----Original Message-----
> > From: general-bounces at openid.net [mailto:general-bounces at openid.net] On
> > Behalf Of Johnny Bufu
> > Sent: Wednesday, July 09, 2008 10:52 PM
> > To: Andrew Arnott
> > Cc: OpenID List
> > Subject: Re: [OpenID] Canonical OpenID url form
> >
> > For the record, since this continued in an offline thread:
> >
> > The issue is around the User-Supplied Identifiers. OpenID defines them
> > as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> > HTTP(S) URI do not allow non-ASCII characters.
> >
> > So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> > should follow the respective authoritative recommendations (i.e.
> > RFC3987) before presenting such strings to the OpenID layer as HTTP
> > URIs, and convert them back to IRI form later on when they need to be
> > displayed back to the users.
> >
> > Johnny
> >
> > On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > > Thanks, Johnny.  I've had some conversations with a few other people
> > > who draw the opposite conclusion and believe that the %AB%CD notation
> > > is the canonical form.
> > >
> > > You make a good point about having to unescape the characters from
> > > the URI just above the transport layer, but I believe you're applying
> > >  section 4.1 to the URL when it should only be applied to the
> > > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > > the last URL to respond without an HTTP redirect, cannot be in
> > > unicode by the URI specification because unicode characters are not
> > > allowed, whether that is UTF8 or UTF16.
> > >
> > > Name/value pairs passed as part of a querystring may (and as the
> > > section you quote requires) be encoded as UTF-8, but they are
> > > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > > encoded) so they are actually no longer UTF-8 at the transport layer.
> > >  Since the OpenID URL, around which all the identity of OpenID is
> > > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > > the transport layer of the way the security requirements force the
> > > claimed identifier to be discovered, is all about the transport
> > > layer, I believe it would be a mistake to add semantics on top of
> > > that and call it canonical.
> > >
> > > What I also realized from some other conversations is that this
> > > doesn't really matter.  As long as an OP or RP is consistent within
> > > itself in storing and comparing Claimed Identifiers, whether it
> > > stores and compares %AB%CD or the unicode equivalent character won't
> > > matter to anyone, since on the protocol/wire level it is always
> > > %AB%CD.  However, I think unescaping the URL and getting the original
> > >  unicode characters back is very useful and should be done for
> > > purposes of displaying to the user.
> > >
> > > I think for the security and guaranteed identity of the protocol,
> > > there is a meaningful side to this though.  It has not got to do with
> > >  how the claimed identifier is stored, but rather how a unicode
> > > string is escaped for URI transport.  A given unicode string may be
> > > represented by more than just one series of bytes.  Unicode
> > > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> > >  /for the same character/. Therefore someone who is typing in their
> > > OpenID url to a site using one method during one visit, and then
> > > types it in to the same site using a different method on a subsequent
> > >  visit, will only be identified by the RP as the same visitor if
> > > OpenID requires that the RP transforms whatever unicode string is
> > > given by the user to the canonical byte form as defined by the
> > > unicode standard before transit.  For example, the letter 'Á' can be
> > > encoded as a single character or using composition by adding an
> > > accent to the A character.  Both are legal, but the unicode standard
> > > defines one as canonical (I think).  But if a string containing this
> > > character is not canonicalized first, then although the character is
> > > equivalent to the user and to unicode, the encoded %AB%CD string will
> > > be different, resulting in security problems for OpenID because
> > > people could overload a single Identifier just by using different
> > > encodings at an OP, or fail to log into an RP depending on how they
> > > craft their string. By the way, I say 'unicode' in the strict sense,
> > > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > > just UTF-16, but this problem applies to all unicode character sizes.
> > >
> > >
> > >
> > >
> > > So I think OpenID should be more explicit about its unicode support
> > > for Identifiers, including mandating a canonical Unicode form.
> > >
> > > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu at gmail.com
> > > <mailto:johnny.bufu at gmail.com>> wrote:
> > >
> > >
> > > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> > >
> > > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > > encoding for unicode chars in the URL or with the actual unicode
> > > chars? For the purposes of displaying to the user and storing in the
> > > RP's database.
> > >
> > > The spec doesn't seem to have anything to say on this.
> > >
> > >
> > > I believe it does say:
> > >
> > > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > > are mappings of plain-text keys to plain-text values. The keys and
> > > values permit the full Unicode character set (UCS). When the keys and
> > >  values need to be converted to/from bytes, they MUST be encoded
> > > using UTF-8 [RFC3629].
> > >
> > > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> > >
> > >
> > > The reason I think it's not a simple automatic answer is the unicode
> > > chars may be what the user typed in and what exists on the server,
> > > but in transit, these characters are translated to %AB%CD in order to
> > >  be validly escaped URI strings.
> > >
> > >
> > > The receiving party must decode them to the original form when they
> > > are extracted from the transport layer.
> > >
> > >
> > > So one could argue that the unicode characters are never part of the
> > > protocol
> > >
> > >
> > > One would then be ignoring the parts of the protocol that do not deal
> > >  with the transport layer directly.
> > >
> > >
> > > Johnny
> > >
> > >
> > > !DSPAM:139,48744d86221113907413095!
> > _______________________________________________
> > general mailing list
> > general at openid.net
> > http://openid.net/mailman/listinfo/general
>
> _______________________________________________
> general mailing list
> general at openid.net
> http://openid.net/mailman/listinfo/general




More information about the general mailing list