[OpenID] XRI I18N (Was Re: Benefits of XRI i-names/i-numbers asOpenIDs)

James A. Donald jamesd at echeque.com
Thu Feb 15 00:03:18 UTC 2007


     --
Tan, William wrote:
 > That is an interesting point, and nicely explains why
 > we have not opened up registration for cyrillic. We
 > have yet to see a mature, well-publicized
 > implementation of Cyrillic from any of the Cyrillic
 > communities / countries. The Polish registry is the
 > only one that publishes a Russian table in IANA
 > http://www.iana.org/assignments/idn/pl-serbian.html
 > and I'm not convinced they have ironed out all issues
 > (e.g. your example of U+0442 and U+043C are both
 > permitted with no special treatment.)

Many people need an algorithm for detecting similar
names.  Such code should become a standard to be applied
in a wide variety of contexts.  I have been hoping for
someone else to write the necessary code:

We need source code for measuring the distance between
two names and a definition of distance between names.

Here is an outline of such an algorithm.

First canonicalize the name, mapping similar characters
to a standard character.  Capital o and the number zero
both map to 0, lower case l and the number one both map
to 1, cyrillic letters map to latin equivalents, and so
on and so forth.

Then find all possible substrings in the canonicalized
name.  Maintain a database of all substrings of all
canonicalized registered names.  This is potentially a
quite large database.

Find all substring collisions (there will be a lot) From
these collisions, find suspected name collisions (names
with long substring collisions and multiple substring
collisions.

For each name suspected of colliding, construct a
character by character diff between the proposed name
its non canonical form, its localized form, and the name
it is suspected of colliding with in its non canonical
form, its localized form. By diff, I mean the equivalent
of the output of the source code control diff, except
character by character by character, instead of line by
line.

Attribute an entropy for each substitution in the diff,
the entropy being very low when the letters are similar,
and about ln 26 when the letters are quite different.
Moves, inserts, and deletes also get about ln 26
entropy.  Inserts and deletes at the end are attributed
a low entropy.

The entropy of the diff is a measure of the distance
between the names.  Forbid names that are excessively
similar - if the entropy of the diff between the
proposed name and an existing name is too low, then the
name is too similar.

The big part of the task is the detection of suspects in
a large database of names.  We do a query that contains
a record for every substring that is common between the
new name and existing names, sorted in order of the non
canonical existing name.  The number of records for an
existing name, relative to the length of the two names,
is rough measure of similarity.  Those with a high
number of records are then measured more precisely for
similarity by doing the diff.

The work required to perform the algorithm scales as the
number of names and the square of the name length, which
can get quite large.  Note, however that first step of
the algorithm is in the canonical form for Google's
system for the massively parallel execution of disk
based algorithms on relational databases, and the second
step is trivially parallelizable.

     --digsig
          James A. Donald
      6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG
      WZzQfVO6qZx0dVjL0m6GFb3mDp8e8wLsY4RkzNc6
      4Dg+OwVJanSLYlaF9uMmPKcH/o/dTU6pH0Z04i0GU



More information about the general mailing list