[OpenID] XRI I18N (Was Re: Benefits of XRI i-names/i-numbers asOpenIDs)

Tan, William William.Tan at neustar.biz
Thu Feb 15 18:28:17 UTC 2007


This is interesting.

Part of the challenge, though, is scaling it beyond A-Z. The sheer size 
of the Unicode repertoire makes it impossible to have a definitive 
database of which characters are considered "similar". Added to the 
complexity is that similarity is dependent on fonts, display size and 
user perception. This database () maintained by Unicode provides a 
starting point but may not work in every context.

=wil

James A. Donald wrote:
>     --
> Tan, William wrote:
> > That is an interesting point, and nicely explains why
> > we have not opened up registration for cyrillic. We
> > have yet to see a mature, well-publicized
> > implementation of Cyrillic from any of the Cyrillic
> > communities / countries. The Polish registry is the
> > only one that publishes a Russian table in IANA
> > http://www.iana.org/assignments/idn/pl-serbian.html
> > and I'm not convinced they have ironed out all issues
> > (e.g. your example of U+0442 and U+043C are both
> > permitted with no special treatment.)
>
> Many people need an algorithm for detecting similar
> names.  Such code should become a standard to be applied
> in a wide variety of contexts.  I have been hoping for
> someone else to write the necessary code:
>
> We need source code for measuring the distance between
> two names and a definition of distance between names.
>
> Here is an outline of such an algorithm.
>
> First canonicalize the name, mapping similar characters
> to a standard character.  Capital o and the number zero
> both map to 0, lower case l and the number one both map
> to 1, cyrillic letters map to latin equivalents, and so
> on and so forth.
>
> Then find all possible substrings in the canonicalized
> name.  Maintain a database of all substrings of all
> canonicalized registered names.  This is potentially a
> quite large database.
>
> Find all substring collisions (there will be a lot) From
> these collisions, find suspected name collisions (names
> with long substring collisions and multiple substring
> collisions.
>
> For each name suspected of colliding, construct a
> character by character diff between the proposed name
> its non canonical form, its localized form, and the name
> it is suspected of colliding with in its non canonical
> form, its localized form. By diff, I mean the equivalent
> of the output of the source code control diff, except
> character by character by character, instead of line by
> line.
>
> Attribute an entropy for each substitution in the diff,
> the entropy being very low when the letters are similar,
> and about ln 26 when the letters are quite different.
> Moves, inserts, and deletes also get about ln 26
> entropy.  Inserts and deletes at the end are attributed
> a low entropy.
>
> The entropy of the diff is a measure of the distance
> between the names.  Forbid names that are excessively
> similar - if the entropy of the diff between the
> proposed name and an existing name is too low, then the
> name is too similar.
>
> The big part of the task is the detection of suspects in
> a large database of names.  We do a query that contains
> a record for every substring that is common between the
> new name and existing names, sorted in order of the non
> canonical existing name.  The number of records for an
> existing name, relative to the length of the two names,
> is rough measure of similarity.  Those with a high
> number of records are then measured more precisely for
> similarity by doing the diff.
>
> The work required to perform the algorithm scales as the
> number of names and the square of the name length, which
> can get quite large.  Note, however that first step of
> the algorithm is in the canonical form for Google's
> system for the massively parallel execution of disk
> based algorithms on relational databases, and the second
> step is trivially parallelizable.
>
>     --digsig
>          James A. Donald
>      6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG
>      WZzQfVO6qZx0dVjL0m6GFb3mDp8e8wLsY4RkzNc6
>      4Dg+OwVJanSLYlaF9uMmPKcH/o/dTU6pH0Z04i0GU




More information about the general mailing list