Thanks, Johnny. I've had some conversations with a few other people who draw the opposite conclusion and believe that the %AB%CD notation is the canonical form.<br><br>You make a good point about having to unescape the characters from the URI just above the transport layer, but <br>
I believe you're applying section 4.1 to the URL when it should only be applied to the key/value pairs. The OpenID ClaimedIdentifier, which by the spec is the last URL to respond without an HTTP redirect, cannot be in unicode by the URI specification because unicode characters are not allowed, whether that is UTF8 or UTF16. <br>
<br>Name/value pairs passed as part of a querystring may (and as the section you quote requires) be encoded as UTF-8, but they are subsequently URI encoded as %AB%CD hex characters (thus doubly encoded) so they are actually no longer UTF-8 at the transport layer. Since the OpenID URL, around which all the identity of OpenID is focused (omiting XRIs which don't suffer from this problem) <i>is</i> at the transport layer of the way the security requirements force the claimed identifier to be discovered, is all about the transport layer, I believe it would be a mistake to add semantics on top of that and call it canonical. <br>
<br>What I also realized from some other conversations is that this doesn't really matter. As long as an OP or RP is consistent within itself in storing and comparing Claimed Identifiers, whether it stores and compares %AB%CD or the unicode equivalent character won't matter to anyone, since on the protocol/wire level it is always %AB%CD. However, I think unescaping the URL and getting the original unicode characters back is very useful and should be done for purposes of displaying to the user.<br>
<br>I think for the security and guaranteed identity of the protocol, there is a meaningful side to this though. It has not got to do with how the claimed identifier is stored, but rather how a unicode string is escaped for URI transport. A given unicode string may be represented by more than just one series of bytes. Unicode characters exist that in UTF-8 or UTF-16 have multiple byte sequences <i>for the same character</i>. Therefore someone who is typing in their OpenID url to a site using one method during one visit, and then types it in to the same site using a different method on a subsequent visit, will only be identified by the RP as the same visitor if OpenID requires that the RP transforms whatever unicode string is given by the user to the canonical byte form as defined by the unicode standard before transit. For example, the letter '<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="ProgId" content="Word.Document"><meta name="Generator" content="Microsoft Word 12"><meta name="Originator" content="Microsoft Word 12"><link rel="File-List" href="file:///C:%5CUsers%5CAndrew%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml"><link rel="Preview" href="file:///C:%5CUsers%5CAndrew%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_preview.wmf"><link rel="themeData" href="file:///C:%5CUsers%5CAndrew%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx"><link rel="colorSchemeMapping" href="file:///C:%5CUsers%5CAndrew%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml"><style>
<!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;
        mso-font-charset:0;
        mso-generic-font-family:roman;
        mso-font-pitch:variable;
        mso-font-signature:-1610611985 1107304683 0 0 159 0;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;
        mso-font-charset:0;
        mso-generic-font-family:swiss;
        mso-font-pitch:variable;
        mso-font-signature:-1610611985 1073750139 0 0 159 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {mso-style-unhide:no;
        mso-style-qformat:yes;
        mso-style-parent:"";
        margin-top:0in;
        margin-right:0in;
        margin-bottom:10.0pt;
        margin-left:0in;
        line-height:115%;
        mso-pagination:widow-orphan;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";
        mso-ascii-font-family:Calibri;
        mso-ascii-theme-font:minor-latin;
        mso-fareast-font-family:Calibri;
        mso-fareast-theme-font:minor-latin;
        mso-hansi-font-family:Calibri;
        mso-hansi-theme-font:minor-latin;
        mso-bidi-font-family:"Times New Roman";
        mso-bidi-theme-font:minor-bidi;}
.MsoChpDefault
        {mso-style-type:export-only;
        mso-default-props:yes;
        mso-ascii-font-family:Calibri;
        mso-ascii-theme-font:minor-latin;
        mso-fareast-font-family:Calibri;
        mso-fareast-theme-font:minor-latin;
        mso-hansi-font-family:Calibri;
        mso-hansi-theme-font:minor-latin;
        mso-bidi-font-family:"Times New Roman";
        mso-bidi-theme-font:minor-bidi;}
.MsoPapDefault
        {mso-style-type:export-only;
        margin-bottom:10.0pt;
        line-height:115%;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;
        mso-header-margin:.5in;
        mso-footer-margin:.5in;
        mso-paper-source:0;}
div.Section1
        {page:Section1;}
-->
</style><span style="font-size: 11pt; line-height: 115%; font-family: "Calibri","sans-serif";">Á</span>' can be encoded as a single character or using composition by adding an accent to the A character. Both are legal, but the unicode standard defines one as canonical (I think). But if a string containing this character is not canonicalized first, then although the character is equivalent to the user and to unicode, the encoded %AB%CD string will be different, resulting in security problems for OpenID because people could overload a single Identifier just by using different encodings at an OP, or fail to log into an RP depending on how they craft their string. By the way, I say 'unicode' in the strict sense, applying to UTF-8, UTF-16, etc. Unicode is commonly used to refer to just UTF-16, but this problem applies to all unicode character sizes.<br>
<br>So I think OpenID should be more explicit about its unicode support for Identifiers, including mandating a canonical Unicode form. <br><br><div class="gmail_quote">On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <<a href="mailto:johnny.bufu@gmail.com">johnny.bufu@gmail.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br>
On 08/07/08 03:01 PM, Andrew Arnott wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
What is the canonical form of an OpenID URL? One with the %AB%CD hex encoding for unicode chars in the URL or with the actual unicode chars? For the purposes of displaying to the user and storing in the RP's database.<br>
<br>
The spec doesn't seem to have anything to say on this. <br>
</blockquote>
<br></div>
I believe it does say:<br>
<br>
4.1. Protocol Messages<br>
The OpenID Authentication protocol messages are mappings of plain-text keys to plain-text values. The keys and values permit the full Unicode character set (UCS). When the keys and values need to be converted to/from bytes, they MUST be encoded using UTF-8 [RFC3629].<br>
<br>
<a href="http://openid.net/specs/openid-authentication-2_0.html#anchor4" target="_blank">http://openid.net/specs/openid-authentication-2_0.html#anchor4</a><div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
The reason I think it's not a simple automatic answer is the unicode chars may be what the user typed in and what exists on the server, but in transit, these characters are translated to %AB%CD in order to be validly escaped URI strings. <br>
</blockquote>
<br></div>
The receiving party must decode them to the original form when they are extracted from the transport layer.<div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
So one could argue that the unicode characters are never part of the protocol <br>
</blockquote>
<br></div>
One would then be ignoring the parts of the protocol that do not deal with the transport layer directly.<br><font color="#888888">
<br>
<br>
Johnny<br>
<br>
</font></blockquote></div><br>