Jump to content

Levenshtein automaton

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by David Eppstein (talk | contribs) at 19:08, 16 September 2015 (Applications: new material needs a source). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer science, a Levenshtein automaton for a string w and a number n is a finite state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.[1]

Applications

Levenshtein automata may be used for spelling correction, by finding words in a given dictionary that are close to a misspelled word. In this application, once a word is identified as being misspelled, its Levenshtein automaton may be constructed, and then applied to all of the words in the dictionary to determine which ones are close to the misspelled word. If the dictionary is stored in compressed form as a trie, the time for this algorithm (after the automaton has been constructed) is proportional to the number of nodes in the trie, significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word.[1]

It is also possible to find words in a regular language, rather than a finite dictionary, that are close to a given target word, by computing the Levenshtein automaton for the word, and then using a Cartesian product construction to combine it with an automaton for the regular language, giving an automaton for the intersection language. Alternatively, rather than using the product construction, both the Levenshtein automaton and the automaton for the given regular language may be traversed simultaneously using a backtracking algorithm.[1] However, the backtracking algorithm may give exponential runtime. Certain Canadian Universities have conducted further study through the application of this automaton for advance backspace spell corrections in mobile environments. The construction of serialized automaton however can take a while to initialize due to the limited CPU performance in a mobile device typically taking 5-6 seconds on a modern ARM architecture processor which is noticeable to the end user.[citation needed]

Construction

For any fixed constant n, the Levenshtein automaton for w and n may be constructed in time O(|w|).[1]

Mitankin studies a variant of this construction called the universal Levenshtein automaton, determined only by a numeric parameter n, that can recognize pairs of words (encoded in a certain way by bitvectors) that are within Levenshtein distance n of each other.[2]

See also

  • agrep, tool (implemented several times) for approximate regular expression matching
  • Bitap algorithm
  • TRE, library for regular expression matching that is tolerant to Levenshtein-style edits

References

  1. ^ a b c d Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast String Correction with Levenshtein-Automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. doi:10.1007/s10032-002-0082-8. CiteSeerx10.1.1.16.652.
  2. ^ Mitankin, Petar N. (2005). Universal Levenshtein Automata. Building and Properties (PDF) (Thesis). Sofia University St. Kliment Ohridski.