Jump to content

Damerau–Levenshtein distance: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Fixed the code, so that string indices are correct inside the loop.
Srchvrs (talk | contribs)
1) Fixed previous incorrect edition of string indexes, reverted to the original version 2) Fixed links to ACM papers 3) Fixed a link to the "fuzzy searching" site (it is a more proper link now)
Line 18: Line 18:
''str1'' of length ''lenStr1'', and ''str2'' of length ''lenStr2'', and computes the
''str1'' of length ''lenStr1'', and ''str2'' of length ''lenStr2'', and computes the
Damerau-Levenshtein distance between them:
Damerau-Levenshtein distance between them:

'''int''' DamerauLevenshteinDistance('''char''' str1[1..lenStr1], '''char''' str2[1..lenStr2])
'''int''' DamerauLevenshteinDistance('''char''' str1[1..lenStr1], '''char''' str2[1..lenStr2])
''// d is a table with lenStr1+1 rows and lenStr2+1 columns''
''// d is a table with lenStr1+1 rows and lenStr2+1 columns''
Line 32: Line 31:
'''for''' i '''from''' 1 '''to''' lenStr1
'''for''' i '''from''' 1 '''to''' lenStr1
'''for''' j '''from''' 1 '''to''' lenStr2
'''for''' j '''from''' 1 '''to''' lenStr2
'''if''' str1[i-1] = str2[j-1] '''then''' cost := 0
'''if''' str1[i] = str2[j] '''then''' cost := 0
'''else''' cost := 1
'''else''' cost := 1
d[i, j] := minimum(
d[i, j] := minimum(
Line 39: Line 38:
d[i-1, j-1] + cost ''// substitution''
d[i-1, j-1] + cost ''// substitution''
)
)
'''if'''(i > 1 '''and''' j > 1 '''and''' str1[i-1] = str2[j-2] and str1[i-2] = str2[j-1]) '''then'''
'''if'''(i > 1 '''and''' j > 1 '''and''' str1[i] = str2[j-1] and str1[i-1] = str2[j]) '''then'''
d[i, j] := minimum(
d[i, j] := minimum(
d[i, j],
d[i, j],
Line 50: Line 49:
Basically this is the algorithm to compute [[Levenshtein distance]] with one additional
Basically this is the algorithm to compute [[Levenshtein distance]] with one additional
recurrence:
recurrence:
'''if'''(i > 1 '''and''' j > 1 '''and''' str1[i-1] = str2[j-2] and str1[i-2] = str2[j-1]) '''then'''
'''if'''(i > 1 '''and''' j > 1 '''and''' str1[i] = str2[j-1] and str1[i-1] = str2[j]) '''then'''
d[i, j] := minimum(
d[i, j] := minimum(
d[i, j],
d[i, j],
Line 98: Line 97:
That is why this limitation is not very important. However, one must remember that
That is why this limitation is not very important. However, one must remember that
restricted edit distance does not always satisfy [[triangle inequality]] and, thus, cannot be
restricted edit distance does not always satisfy [[triangle inequality]] and, thus, cannot be
used with [[metric trees]]. An extension of the edit distance algorithm, that '''does satisfy''' a [[triangle inequality]] is described in the paper: [http://portal.acm.org/citation.cfm?id=321880 F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964]
used with [[metric trees]].




Line 109: Line 108:


== References ==
== References ==
* {{note|D64}} [http://dalcin.org/referencias/pdf/325.pdf F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964]
* {{note|D64}} [http://portal.acm.org/citation.cfm?id=363994 F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964]
* {{note|L66}} V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1966.
* {{note|L66}} V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1966.
* {{note|LW75}} [http://dalcin.org/referencias/pdf/697.pdf R. Lowrance, R. Wagner. An Extension of the String-to-String Correction Problem, JACM, 1975.]
* {{note|LW75}} [http://portal.acm.org/citation.cfm?id=321880 R. Lowrance, R. Wagner. An Extension of the String-to-String Correction Problem, JACM, 1975.]
* {{note|Nav01}} [http://www.egeen.ee/u/vilo/edu/2002-03/Tekstialgoritmid_I/Articles/Approximate/Navarro_Review_on_Approximate_Matching_p31-navarro.pdf G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.]
* {{note|Nav01}} [http://www.egeen.ee/u/vilo/edu/2002-03/Tekstialgoritmid_I/Articles/Approximate/Navarro_Review_on_Approximate_Matching_p31-navarro.pdf G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.]


== External links ==
== External links ==


* {{note|itman}} [http://itman.narod.ru/english Site devoted to fuzzy searching and information retrieval]
* {{note|itman}} [http://itman.narod.ru/english/ir/index.html Site devoted to fuzzy searching and information retrieval]
* Project Dedupe http://dedupe.sourceforge.net
* Project Dedupe http://dedupe.sourceforge.net



Revision as of 19:53, 18 November 2006

Damerau-Levenshtein distance is an extension of Levenshtein distance that counts transposition as a single edit operation. Strictly speaking, the Damerau-Levenshtein distance is equal to the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Damerau in his seminal paper [1] not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. It is worth noting that Damerau concentrated on single-character misspellings. The edit distance itself was introduced by Levenshtein [2] , who generalized this concept by allowing multiple edit operations, but did not include transpositions in the set of basic operations.


The algorithm

Adding transpositions sounds simple, but in reality there is a serious complication. Firstly, let us consider a direct extension of the formula used to calculate Levenshtein distance. Below is pseudocode for a function DamerauLevenshteinDistance that takes two strings, str1 of length lenStr1, and str2 of length lenStr2, and computes the Damerau-Levenshtein distance between them:

int DamerauLevenshteinDistance(char str1[1..lenStr1], char str2[1..lenStr2])
   // d is a table with lenStr1+1 rows and lenStr2+1 columns
   declare int d[0..lenStr1, 0..lenStr2]
   // i and j are used to iterate over str1 and str2
   declare int i, j, cost
 
   for i from 0 to lenStr1
       d[i, 0] := i
   for j from 0 to lenStr2
       d[0, j] := j
 
   for i from 1 to lenStr1
       for j from 1 to lenStr2
           if str1[i] = str2[j] then cost := 0
                                else cost := 1
           d[i, j] := minimum(
                                d[i-1, j  ] + 1,     // deletion
                                d[i  , j-1] + 1,     // insertion
                                d[i-1, j-1] + cost   // substitution
                            )
           if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then
               d[i, j] := minimum(
                                d[i, j],
                                d[i-2, j-2] + cost   // transposition
                             )
                                
 
   return d[lenStr1, lenStr2]

Basically this is the algorithm to compute Levenshtein distance with one additional recurrence:

           if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then
               d[i, j] := minimum(
                                d[i, j],
                                d[i-2, j-2] + cost   // transposition
                             )

Let us calculate pair-wise distances between the strings TO, OT and OST using this algorithm. The distance between TO and OT is 1. The same for OT vs. OST. But the distance between TO and OST is 3, even though the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want. Obviously, triangle inequality is also broken.

In reality this algorithm calculates the cost of the so-called optimal string alignment, which does not always equal the edit distance. It is also easy to see that the cost of the optimal string alignment is the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once. We will also call this value a restricted edit distance. As noted by G. Navarro [3], in the general case, i.e. when a set of elementary edition operations includes substitutions of arbitrary length strings, unrestricted edit distance is hardly computable. However, the goal is achievable in the simpler case of Damerau-Levenshtein distance. It is also possible (see [4] for details) to compute unrestricted distance treating reversals of arbitrary, not obligatory adjacent characters as a single edit operation.

To devise a proper algorithm to calculate unrestricted Damerau-Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: , where M and N are string lengths. Using ideas of Lowrance and Wagner [5], this naive algorithm can be improved to be in the worst case.

It is interesting that the bitap algorithm can be modified to process transposition. See, information retrieval section of the site [6] for an example of such an adaptation.

Algorithm discussion

The above-described pseudo-code calculates only restricted edit distance. Damerau-Levenshtein distance plays an important role in natural language processing. It is worth noting that in natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. That is why this limitation is not very important. However, one must remember that restricted edit distance does not always satisfy triangle inequality and, thus, cannot be used with metric trees. An extension of the edit distance algorithm, that does satisfy a triangle inequality is described in the paper: F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964


See also

References