[OpenID] XRI I18N (Was Re: Benefits of XRI i-names/i-numbers asOpenIDs)

Tan, William William.Tan at neustar.biz
Thu Feb 15 18:30:36 UTC 2007


Would help if I included the link:

http://www.unicode.org/reports/tr39/data/confusables.txt

=wil

Tan, William wrote:
> This is interesting.
>
> Part of the challenge, though, is scaling it beyond A-Z. The sheer 
> size of the Unicode repertoire makes it impossible to have a 
> definitive database of which characters are considered "similar". 
> Added to the complexity is that similarity is dependent on fonts, 
> display size and user perception. This database () maintained by 
> Unicode provides a starting point but may not work in every context.
>
> =wil
>
> James A. Donald wrote:
>>     --
>> Tan, William wrote:
>> > That is an interesting point, and nicely explains why
>> > we have not opened up registration for cyrillic. We
>> > have yet to see a mature, well-publicized
>> > implementation of Cyrillic from any of the Cyrillic
>> > communities / countries. The Polish registry is the
>> > only one that publishes a Russian table in IANA
>> > http://www.iana.org/assignments/idn/pl-serbian.html
>> > and I'm not convinced they have ironed out all issues
>> > (e.g. your example of U+0442 and U+043C are both
>> > permitted with no special treatment.)
>>
>> Many people need an algorithm for detecting similar
>> names.  Such code should become a standard to be applied
>> in a wide variety of contexts.  I have been hoping for
>> someone else to write the necessary code:
>>
>> We need source code for measuring the distance between
>> two names and a definition of distance between names.
>>
>> Here is an outline of such an algorithm.
>>
>> First canonicalize the name, mapping similar characters
>> to a standard character.  Capital o and the number zero
>> both map to 0, lower case l and the number one both map
>> to 1, cyrillic letters map to latin equivalents, and so
>> on and so forth.
>>
>> Then find all possible substrings in the canonicalized
>> name.  Maintain a database of all substrings of all
>> canonicalized registered names.  This is potentially a
>> quite large database.
>>
>> Find all substring collisions (there will be a lot) From
>> these collisions, find suspected name collisions (names
>> with long substring collisions and multiple substring
>> collisions.
>>
>> For each name suspected of colliding, construct a
>> character by character diff between the proposed name
>> its non canonical form, its localized form, and the name
>> it is suspected of colliding with in its non canonical
>> form, its localized form. By diff, I mean the equivalent
>> of the output of the source code control diff, except
>> character by character by character, instead of line by
>> line.
>>
>> Attribute an entropy for each substitution in the diff,
>> the entropy being very low when the letters are similar,
>> and about ln 26 when the letters are quite different.
>> Moves, inserts, and deletes also get about ln 26
>> entropy.  Inserts and deletes at the end are attributed
>> a low entropy.
>>
>> The entropy of the diff is a measure of the distance
>> between the names.  Forbid names that are excessively
>> similar - if the entropy of the diff between the
>> proposed name and an existing name is too low, then the
>> name is too similar.
>>
>> The big part of the task is the detection of suspects in
>> a large database of names.  We do a query that contains
>> a record for every substring that is common between the
>> new name and existing names, sorted in order of the non
>> canonical existing name.  The number of records for an
>> existing name, relative to the length of the two names,
>> is rough measure of similarity.  Those with a high
>> number of records are then measured more precisely for
>> similarity by doing the diff.
>>
>> The work required to perform the algorithm scales as the
>> number of names and the square of the name length, which
>> can get quite large.  Note, however that first step of
>> the algorithm is in the canonical form for Google's
>> system for the massively parallel execution of disk
>> based algorithms on relational databases, and the second
>> step is trivially parallelizable.
>>
>>     --digsig
>>          James A. Donald
>>      6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG
>>      WZzQfVO6qZx0dVjL0m6GFb3mDp8e8wLsY4RkzNc6
>>      4Dg+OwVJanSLYlaF9uMmPKcH/o/dTU6pH0Z04i0GU
>
>




More information about the general mailing list