Levenshtein automaton: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 14:16, 27 September 2023

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.^[1]

Applications

Levenshtein automata may be used for spelling correction, by finding words in a given dictionary that are close to a misspelled word. In this application, once a word is identified as being misspelled, its Levenshtein automaton may be constructed, and then applied to all of the words in the dictionary to determine which ones are close to the misspelled word. If the dictionary is stored in compressed form as a trie, the time for this algorithm (after the automaton has been constructed) is proportional to the number of nodes in the trie, significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word.^[1]

It is also possible to find words in a regular language, rather than a finite dictionary, that are close to a given target word, by computing the Levenshtein automaton for the word, and then using a Cartesian product construction to combine it with an automaton for the regular language, giving an automaton for the intersection language. Alternatively, rather than using the product construction, both the Levenshtein automaton and the automaton for the given regular language may be traversed simultaneously using a backtracking algorithm.^[1]

Levenshtein automata are used in Lucene for full-text searches that can return relevant documents even if the query is misspelled.^[2]

Construction

For any fixed constant n, the Levenshtein automaton for w and n may be constructed in time O(|w|).^[1]

Mitankin studies a variant of this construction called the universal Levenshtein automaton, determined only by a numeric parameter n, that can recognize pairs of words (encoded in a certain way by bitvectors) that are within Levenshtein distance n of each other.^[3] Touzet proposed an effective algorithm to build this automaton.^[4]

Yet a third finite automaton construction of Levenshtein (or Damerau–Levenshtein) distance are the Levenshtein transducers of Hassan et al., who show finite state transducers implementing edit distance one, then compose these to implement edit distances up to some constant.^[5]

References

^ ^a ^b ^c ^d Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast String Correction with Levenshtein-Automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX 10.1.1.16.652. doi:10.1007/s10032-002-0082-8. S2CID 207046453.
^ McCandless, Michael (24 March 2011). "Lucene's FuzzyQuery is 100 times faster in 4.0". Changing Bits. Retrieved 2021-06-07.
^ Mitankin, Petar N. (2005). Universal Levenshtein Automata. Building and Properties (PDF) (Thesis). Sofia University St. Kliment Ohridski.
^ Touzet H. (2016). "On the Levenshtein Automaton and the Size of the Neighbourhood of a Word" (PDF). Language and Automata Theory and Applications. Lecture Notes in Computer Science. Vol. 9618. Lecture Notes in Computer Science. pp. 207–218. doi:10.1007/978-3-319-30000-9_16. ISBN 978-3-319-29999-0. S2CID 34821290.
^ Hassan, Ahmed; Noeman, Sara; Hassan, Hany (2008). Language Independent Text Correction using Finite State Automata. IJCNLP.

[ssm-1] Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast String Correction with Levenshtein-Automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX 10.1.1.16.652. doi:10.1007/s10032-002-0082-8. S2CID 207046453.

[2] McCandless, Michael (24 March 2011). "Lucene's FuzzyQuery is 100 times faster in 4.0". Changing Bits. Retrieved 2021-06-07.

[3] Mitankin, Petar N. (2005). Universal Levenshtein Automata. Building and Properties (PDF) (Thesis). Sofia University St. Kliment Ohridski.

[4] Touzet H. (2016). "On the Levenshtein Automaton and the Size of the Neighbourhood of a Word" (PDF). Language and Automata Theory and Applications. Lecture Notes in Computer Science. Vol. 9618. Lecture Notes in Computer Science. pp. 207–218. doi:10.1007/978-3-319-30000-9_16. ISBN 978-3-319-29999-0. S2CID 34821290.

[5] Hassan, Ahmed; Noeman, Sara; Hassan, Hany (2008). Language Independent Text Correction using Finite State Automata. IJCNLP.

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
+{{Short description|Mathematical model for string comparison}}
-In [[computer science]], a '''Levenshtein automaton''' for a [[string (computer science)|string]] ''w'' and a number ''n'' is a [[finite state automaton]] that can recognize the set of all strings whose [[Levenshtein distance]] from ''w'' is at most ''n''. That is, a string ''x'' is in the [[formal language]] recognized by the Levenshtein automaton if and only if ''x'' can be transformed into ''w'' by at most ''n'' single-character insertions, deletions, and substitutions.<ref name="ssm">{{cite journal | first1 = Klaus U. | last1 = Schulz | first2 = Stoyan | last2 = Mihov | year = 2002 | id = {{citeseerx|10.1.1.16.652}} | title = Fast String Correction with Levenshtein-Automata | journal = International Journal of Document Analysis and Recognition | volume = 5 | issue = 1 | pages = 67–85 | doi = 10.1007/s10032-002-0082-8 }}</ref>
+In [[computer science]], a '''Levenshtein automaton''' for a [[string (computer science)|string]] ''w'' and a number ''n'' is a [[finite-state automaton]] that can recognize the set of all strings whose [[Levenshtein distance]] from ''w'' is at most ''n''. That is, a string ''x'' is in the [[formal language]] recognized by the Levenshtein automaton if and only if ''x'' can be transformed into ''w'' by at most ''n'' single-character insertions, deletions, and substitutions.<ref name="ssm">{{cite journal | first1 = Klaus U. | last1 = Schulz | first2 = Stoyan | last2 = Mihov | year = 2002 | citeseerx = 10.1.1.16.652 | title = Fast String Correction with Levenshtein-Automata | journal = International Journal of Document Analysis and Recognition | volume = 5 | issue = 1 | pages = 67–85 | doi = 10.1007/s10032-002-0082-8 | s2cid = 207046453 | url = https://link.springer.com/article/10.1007/s10032-002-0082-8 | url-access = subscription}}</ref>
 ==Applications==
 Levenshtein automata may be used for spelling correction, by finding words in a given dictionary that are close to a misspelled word. In this application, once a word is identified as being misspelled, its Levenshtein automaton may be constructed, and then applied to all of the words in the dictionary to determine which ones are close to the misspelled word. If the dictionary is stored in compressed form as a [[trie]], the time for this algorithm (after the automaton has been constructed) is proportional to the number of nodes in the trie, significantly faster than using [[dynamic programming]] to compute the Levenshtein distance separately for each dictionary word.<ref name="ssm"/>
-It is also possible to find words in a [[regular language]], rather than a finite dictionary, that are close to a given target word, by computing the Levenshtein automaton for the word, and then using a [[Cartesian product]] construction to combine it with an automaton for the regular language, giving an automaton for the intersection language. Alternatively, rather than using the product construction, both the Levenshtein automaton and the automaton for the given regular language may be traversed simultaneously using a [[backtracking]] algorithm.<ref name="ssm"/> However, the backtracking algorithm may give exponential runtime. Certain Canadian Universities have conducted further study through the application of this automaton for advance backspace spell corrections in mobile environments. The construction of serialized automaton however can take a while to initialize due to the limited CPU performance in a mobile device typically taking 5-6 seconds on a modern ARM architecture processor which is noticeable to the end user.{{cn|date=September 2015}}
+It is also possible to find words in a [[regular language]], rather than a finite dictionary, that are close to a given target word, by computing the Levenshtein automaton for the word, and then using a [[Cartesian product]] construction to combine it with an automaton for the regular language, giving an automaton for the intersection language. Alternatively, rather than using the product construction, both the Levenshtein automaton and the automaton for the given regular language may be traversed simultaneously using a [[backtracking]] algorithm.<ref name="ssm"/>
+Levenshtein automata are used in [[Apache Lucene|Lucene]] for [[Full-text search|full-text searches]] that can return relevant documents even if the query is misspelled.<ref>{{Cite web|last=McCandless|first=Michael|date=24 March 2011|title=Lucene's FuzzyQuery is 100 times faster in 4.0|url=http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html|access-date=2021-06-07|website=Changing Bits|language=en}}</ref>
 ==Construction==
 For any fixed constant ''n'', the Levenshtein automaton for ''w'' and ''n'' may be constructed in time O(|''w''|).<ref name="ssm"/>
-Mitankin studies a variant of this construction called the '''universal Levenshtein automaton''', determined only by a numeric parameter ''n'', that can recognize pairs of words (encoded in a certain way by bitvectors) that are within Levenshtein distance ''n'' of each other.<ref>{{cite thesis | first = Petar N. | last = Mitankin | year = 2005 | url = http://www.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-en.pdf | title = Universal Levenshtein Automata. Building and Properties | publisher = Sofia University St. Kliment Ohridski }}</ref>
+Mitankin studies a variant of this construction called the '''universal Levenshtein automaton''', determined only by a numeric parameter ''n'', that can recognize pairs of words (encoded in a certain way by bitvectors) that are within Levenshtein distance ''n'' of each other.<ref>{{cite thesis | first = Petar N. | last = Mitankin | year = 2005 | url = http://www.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-en.pdf | title = Universal Levenshtein Automata. Building and Properties | publisher = Sofia University St. Kliment Ohridski }}</ref> Touzet proposed an effective algorithm to build this automaton.<ref>{{cite book | author = Touzet H.  | chapter = On the Levenshtein Automaton and the Size of the Neighbourhood of a Word | year = 2016 |title= Language and Automata Theory and Applications| series = Lecture Notes in Computer Science |  publisher = Lecture Notes in Computer Science | volume = 9618 |pages = 207–218| doi = 10.1007/978-3-319-30000-9_16 | isbn = 978-3-319-29999-0 | s2cid = 34821290 | chapter-url = https://hal.archives-ouvertes.fr/hal-01360482/file/LATA2016.pdf }} </ref>
-Yet a third finite automaton construction of Levenshtein (or [[Levenshtein–Damerau distance|Levenshtein–Damerau]]) distance are the Levenshtein transducers of Hassan ''et al.'', who show [[finite state transducer]]s implementing edit distance one, then compose these to implement edit distances up to some constant.<ref>{{cite conference |last1=Hassan |first1=Ahmed |first2=Sara |last2=Noeman |first3=Hany |last3=Hassan |title=Language Independent Text Correction using Finite State Automata |conference=IJCNLP |year=2008 |url=http://www.aclweb.org/anthology/D/D09/D09-1093.pdf}}</ref>
+Yet a third finite automaton construction of Levenshtein (or [[Damerau–Levenshtein distance|Damerau–Levenshtein]]) distance are the Levenshtein transducers of Hassan ''et al.'', who show [[finite state transducer]]s implementing edit distance one, then compose these to implement edit distances up to some constant.<ref>{{cite conference |last1=Hassan |first1=Ahmed |first2=Sara |last2=Noeman |first3=Hany |last3=Hassan |title=Language Independent Text Correction using Finite State Automata |conference=IJCNLP |year=2008 |url=http://www.aclweb.org/anthology/I08-2131}}</ref>
 ==See also==
 * [[agrep]], tool (implemented several times) for approximate regular expression matching
-* [[Bitap algorithm]]
 * [[TRE (computing)|TRE]], library for regular expression matching that is tolerant to Levenshtein-style edits
 ==References==
 {{reflist}}
+{{Strings}}
 [[Category:Finite automata]]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations