Jump to content

Fuzzy string searching: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Srchvrs (talk | contribs)
mNo edit summary
Srchvrs (talk | contribs)
m included metric trees and suffix trees references
Line 42: Line 42:
performance on large data is unacceptable.
performance on large data is unacceptable.
In its turn, text preprocessing, or in other words indexing, makes searching dramatically faster.
In its turn, text preprocessing, or in other words indexing, makes searching dramatically faster.
Today, a variety of indexing algorithms are presented. Among them are suffix trees {{ref|Gus97}},
Today, a variety of indexing algorithms are presented. Among them are [[suffix tree|suffix trees]] {{ref|Gus97}},
metric trees {{ref|NB98}} and [[n-gram]] methods {{ref|NBST01}}{{ref|Zob95}}.
[[metric trees]] {{ref|NB98}} and [[n-gram]] methods {{ref|NBST01}}{{ref|Zob95}}.
For a detailed list of indexing techniques I would address the reader to the paper of Navarro et. al.{{ref|NBST01}}
For a detailed list of indexing techniques I would address the reader to the paper of Navarro et. al.{{ref|NBST01}}



Revision as of 07:16, 15 December 2005

Fuzzy string searching is the name for a category of techniques for finding strings that approximately match some given pattern string. Fuzzy string searching has two different flavors: finding one or more matching substrings of a text, and finding similar strings in a string set often referred to as dictionary. Fuzzy string searching has many application areas including information retrieval, spellchecking and computational biology [1].

The corner stone of any approximate searching method is a similarity function. Among most commonly used similarity functions are Levenshtein distance and n-gram distance. The latter is based on counting of the number of common n-grams. It is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen [2] described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.

Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without indexation. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher [3] and by Sellers [4]. Both algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

On-line searching techniques were repeatedly improved. Perhaps, the most famous improvement is bitap algorithm (also known as shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap algorithm is the heart of Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro [5].

Although very fast on-line techniques exist their performance on large data is unacceptable. In its turn, text preprocessing, or in other words indexing, makes searching dramatically faster. Today, a variety of indexing algorithms are presented. Among them are suffix trees [6], metric trees [7] and n-gram methods [8][9]. For a detailed list of indexing techniques I would address the reader to the paper of Navarro et. al.[10]

See also

References