Fuzzy string searching
Fuzzy string searching is the name for a category of techniques for finding strings that approximately match some given pattern string. Fuzzy string searching has two different flavors: finding one or more matching substrings of a text, and finding similar strings in a string set often referred to as dictionary. Fuzzy string searching has many application areas including information retrieval, spellchecking and computational biology [1].
The corner stone of any approximate searching method is a similarity function. Among most commonly used similarity functions are Levenshtein distance and n-gram distance. The latter is based on counting of the number of common n-grams. It is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen [2] described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.
Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without indexation. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher [3] and by Sellers [4]. Both algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.
On-line searching techniques were repeatedly improved. Perhaps, the most famous improvement is bitap algorithm (also known as shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap algorithm is the heart of Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro [5].
Although very fast on-line techniques exist their performance on large data is unacceptable. In its turn, text preprocessing, or in other words indexing, makes searching dramatically faster. Today, a variety of indexing algorithms are presented. Among them are suffix trees [6], metric trees [7] and n-gram methods [8][9]. For a detailed list of indexing techniques I would address the reader to the paper of Navarro et. al.[10]
See also
- Soundex
- Spellchecker
- String searching algorithm
- Wildcard character
- Levenshtein distance
- Computer-assisted translation
References
- ^ R. Baeza-Yates and G. Navarro. Fast Approximate String Matching in a Dictionary.Proc. SPIRE'98. IEEE CS Press, pages 14-22.
- ^ D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
- G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.
- ^ G. Navarro, Ricardo Baeza-Yates, E. Sutinen and J. Tarhio. Indexing Methods for Approximate String Matching.IEEE Data Engineering Bulletin 24(4):19-27, 2001.
- ^ P.H. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1:359-373, 1980.
- ^ R. Wagner and M. Fisher, The string-to-string correction problem, Journal of the association for computing machinery, vol. 21, pp. 168 173, 1974.
- ^ J. Zobel, P. Dart. Finding approximate matches in large lexicons. Software-Practice & Experience 25(3), pp 331-345, 1995.