Approximate string matching: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 22:47, 6 December 2024

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

Overview

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:^[1]

insertion: cot → coat
deletion: coat → cot
substitution: coat → cost

These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:

insertion: co*t → coat
deletion: coat → co*t
substitution: coat → cost

Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation.^[1]

transposition: cost → cots

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is coil, foil differs by one substitution, coils by one insertion, oil by one deletion, and foal by two substitutions. If all operations count as a single unit of cost and the limit is set to one, foil, coils, and oil will count as matches while foal will not.

Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers permit separate assignments of limits and weights to individual groups in the pattern.

Problem formulation and algorithms

One possible definition of the approximate string matching problem is the following: Given a pattern string $P=p_{1}p_{2}...p_{m}$ and a text string $T=t_{1}t_{2}\dots t_{n}$ , find a substring $T_{j',j}=t_{j'}\dots t_{j}$ in T, which, of all substrings of T, has the smallest edit distance to the pattern P.

A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n³ m).

A better solution, which was proposed by Sellers,^[2] relies on dynamic programming. It uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, $P_{i}$ , and any substring $T_{j',j}$ of T that ends at position j.

For each position j in the text T, and each position i in the pattern P, go through all substrings of T ending at position j, and determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E(i, j). After computing E(i, j) for all i and j, we can easily find a solution to the original problem: it is the substring for which E(m, j) is minimal (m being the length of the pattern P.)

Computing E(m, j) is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E(m, j), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used E(i − 1,j), E(i,j − 1) or E(i − 1,j − 1) in computing E(i, j).

In the array containing the E(x, y) values, we then choose the minimal value in the last row, let it be E(x₂, y₂), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was E(0, y₁), then T[y₁ + 1] ... T[y₂] is a substring of T with the minimal edit distance to the pattern P.

Computing the E(x, y) array takes O(mn) time with the dynamic programming algorithm, while the backwards-working phase takes O(n + m) time.

Another recent idea is the similarity join. When matching database relates to a large scale of data, the O(mn) time with the dynamic programming algorithm cannot work within a limited time. So, the idea is to reduce the number of candidate pairs, instead of computing the similarity of all pairs of strings. Widely used algorithms are based on filter-verification, hashing, Locality-sensitive hashing (LSH), Tries and other greedy and approximation algorithms. Most of them are designed to fit some framework (such as Map-Reduce) to compute concurrently.

Online versus offline

Traditionally, approximate string matching algorithms are classified into two categories: online and offline. With online algorithms the pattern can be processed before searching but the text cannot. In other words, online techniques do searching without an index. Early algorithms for online approximate matching were suggested by Wagner and Fischer^[3] and by Sellers.^[2] Both algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fischer calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

Online searching techniques have been repeatedly improved. Perhaps the most famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The bitap algorithm is the heart of the Unix searching utility agrep. A review of online searching algorithms was done by G. Navarro.^[4]

Although very fast online techniques exist, their performance on large data is disfavored. Text preprocessing or indexing makes searching dramatically faster. Today, a variety of indexing algorithms have been presented. Among them are suffix trees,^[5] metric trees^[6] and n-gram methods.^[7]^[8] A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro et al.^[7] A computational survey of dictionary methods (i.e., methods that permit finding all dictionary words that approximately match a search pattern) is given by Boytsov.^[9]

Applications

Common applications of approximate matching include spell checking.^[5] With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application.^[1] Approximate matching is also used in spam filtering.^[5] Record linkage is a common application where records from two disparate databases are matched.

String matching cannot be used for most binary data, such as images and music. They require different algorithms, such as acoustic fingerprinting.

A common command-line tool fzf is often used to integrate approximate string searching into various command-line applications.^[10]

References

Citations

^ ^a ^b ^c Cormen & Leiserson 2001.
^ ^a ^b Sellers 1980.
^ Wagner & Fischer 1974.
^ Navarro 2001.
^ ^a ^b ^c Gusfield 1997.
^ Baeza-Yates & Navarro 1998.
^ ^a ^b Navarro et al. 2001.
^ Zobel & Dart 1995.
^ Boytsov 2011.
^ "Fzf - A Quick Fuzzy File Search from Linux Terminal". www.tecmint.com. 2018-11-08. Retrieved 2022-09-08.

Works cited

Baeza-Yates, R.; Navarro, G. (1998). "Fast Approximate String Matching in a Dictionary" (PDF). Proc. SPIRE'98. IEEE CS Press. pp. 14–22.
Boytsov, Leonid (2011). "Indexing methods for approximate dictionary searching: Comparative analysis". Journal of Experimental Algorithmics. 16 (1): 1–91. doi:10.1145/1963190.1963191. S2CID 15635688.
Cormen, Thomas; Leiserson, Rivest (2001). Introduction to Algorithms (2nd ed.). MIT Press. pp. 364–7. ISBN 978-0-262-03293-3.
Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge, UK: Cambridge University Press. ISBN 978-0-521-58519-4.
Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. CiteSeerX 10.1.1.96.7225. doi:10.1145/375360.375365. S2CID 207551224.
Navarro, Gonzalo; Baeza-Yates, Ricardo; Sutinen, Erkki; Tarhio, Jorma (2001). "Indexing Methods for Approximate String Matching" (PDF). IEEE Data Engineering Bulletin. 24 (4): 19–27.
Sellers, Peter H. (1980). "The Theory and Computation of Evolutionary Distances: Pattern Recognition". Journal of Algorithms. 1 (4): 359–73. doi:10.1016/0196-6774(80)90016-4.
^ Skiena, Steve (1998). Algorithm Design Manual (1st ed.). Springer. ISBN 978-0-387-94860-7.
Wagner, R.; Fischer, M. (1974). "The string-to-string correction problem". Journal of the ACM. 21: 168–73. doi:10.1145/321796.321811. S2CID 13381535.
Zobel, Justin; Dart, Philip (1995). "Finding approximate matches in large lexicons". Software: Practice and Experience. 25 (3): 331–345. CiteSeerX 10.1.1.14.3856. doi:10.1002/spe.4380250307. S2CID 6776819.

External links

Flamingo Project
Efficient Similarity Query Processing Project with recent advances in approximate string matching based on an edit distance threshold.
StringMetric project a Scala library of string metrics and phonetic algorithms
Natural project a JavaScript natural language processing library which includes implementations of popular string metrics

[FOOTNOTECormenLeiserson2001-1] Cormen & Leiserson 2001.

[FOOTNOTESellers1980-2] Sellers 1980.

[FOOTNOTEWagnerFischer1974-3] Wagner & Fischer 1974.

[FOOTNOTENavarro2001-4] Navarro 2001.

[FOOTNOTEGusfield1997-5] Gusfield 1997.

[FOOTNOTEBaeza-YatesNavarro1998-6] Baeza-Yates & Navarro 1998.

[FOOTNOTENavarroBaeza-YatesSutinenTarhio2001-7] Navarro et al. 2001.

[FOOTNOTEZobelDart1995-8] Zobel & Dart 1995.

[FOOTNOTEBoytsov2011-9] Boytsov 2011.

[10] "Fzf - A Quick Fuzzy File Search from Linux Terminal". www.tecmint.com. 2018-11-08. Retrieved 2022-09-08.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ Line 1: / Line 1: @@
+{{Short description|Finding strings that approximately match a pattern}}
-In [[computing]], '''approximate string matching''' (often colloquially referred to as '''fuzzy string searching''') is the technique of finding approximate matches to a [[pattern]] in a [[String (computing)|string]].
+[[File:Did you mean andré emotions.png|thumb|300px|A fuzzy Mediawiki search for "angry emoticon" has as a suggested result "andré emotions"]]
+In [[computer science]], '''approximate string matching''' (often colloquially referred to as '''fuzzy string searching''') is the technique of finding  [[String (computing)|strings]] that match a [[pattern]] approximately (rather than exactly). The problem of approximate string matching is typically divided into two sub-problems: finding approximate [[substring]] matches inside a given string and finding dictionary strings that match the pattern approximately.
-== Overview ==
+== Overview==
-The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the [[edit distance]] between the string and the pattern. The usual primitive operations are:{{ref|CRMN01}}
+The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the [[edit distance]] between the string and the pattern. The usual primitive operations are:{{sfn|Cormen|Leiserson|2001}}
-* insertion: ''cot'' → ''co'''a'''t''
-* deletion:  ''co'''a'''t'' → ''cot''
-* substitution: ''co'''a'''t'' → ''co'''s'''t''
-These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by λ) wherever a character has been deleted or inserted:
-* insertion: ''co'''λ'''t'' → ''co'''a'''t''
-* deletion:  ''co'''a'''t'' → ''co'''λ'''t''
-* substitution:  ''co'''a'''t'' → ''co'''s'''t''
+* insertion: ''cot'' → ''co'''a'''t''
-Some approximate matchers also treat ''transposition'', in which the positions of two letters in the string are swapped, to be a primitive operation. Changing ''cost'' to ''cots'' is an example of a transposition.{{ref|CRMN01}}
+* deletion:  ''co'''a'''t'' → ''cot''
+* substitution: ''co'''a'''t'' → ''co'''s'''t''
+These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:
+* insertion: ''co'''*'''t'' → ''co'''a'''t''
+* deletion:  ''co'''a'''t'' → ''co'''*'''t''
+* substitution:  ''co'''a'''t'' → ''co'''s'''t''
+Some approximate matchers also treat ''transposition'', in which the positions of two letters in the string are swapped, to be a primitive operation.{{sfn|Cormen|Leiserson|2001}}
+* transposition:  ''co'''st''''' → ''co'''ts'''''
 Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is ''coil'', ''foil'' differs by one substitution, ''coils'' by one insertion, ''oil'' by one deletion, and ''foal'' by two substitutions. If all operations count as a single unit of cost and the limit is set to one, ''foil'', ''coils'', and ''oil'' will count as matches while ''foal'' will not.
-Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers allow separate assignments of limits and weights to individual groups in the pattern.
+Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers permit separate assignments of limits and weights to individual groups in the pattern.
 == Problem formulation and algorithms ==
 One possible definition of the approximate string matching problem is the following: Given a pattern string <math>P = p_1p_2...p_m</math> and a text string <math>T = t_1t_2\dots t_n</math>, find a substring <math>T_{j',j} = t_{j'}\dots t_j</math> in ''T'', which, of all substrings of ''T'', has the smallest edit distance to the pattern ''P''.
-A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time [[Big O notation|''O'']](''n''<sup>3</sup>&nbsp;''m'')
+A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time [[Big O notation|''O'']](''n''<sup>3</sup>&nbsp;''m'').
-A better solution{{ref|SKN01}}{{ref|OTMN01}}, utilizing [[dynamic programming]], uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, <math>P_i</math>, and any substring <math>T_{j',j}</math> of T that ends at position ''j''.
+A better solution, which was proposed by Sellers,{{sfn|Sellers|1980}} relies on [[dynamic programming]]. It uses an alternative formulation of the problem: for each position ''j'' in the text ''T'' and each position ''i'' in the pattern ''P'', compute the minimum edit distance between the ''i'' first characters of the pattern, <math>P_i</math>, and any substring <math>T_{j',j}</math> of ''T'' that ends at position ''j''.
 For each position ''j'' in the text ''T'', and each position ''i'' in the pattern ''P'', go through all substrings of ''T'' ending at position ''j'', and determine which one of them has the minimal
 edit distance to the ''i'' first characters of the pattern ''P''. Write this minimal distance as ''E''(''i'',&nbsp;''j''). After computing ''E''(''i'',&nbsp;''j'') for all ''i'' and ''j'', we can easily find a solution to the original problem: it is the substring for which ''E''(''m'',&nbsp;''j'') is minimal (''m'' being the length of the pattern ''P''.)
 Computing ''E''(''m'',&nbsp;''j'') is very similar to computing the edit distance between two strings. In fact, we can use the [[Levenshtein distance#Computing Levenshtein distance|Levenshtein distance computing algorithm]] for ''E''(''m'',&nbsp;''j''), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used ''E''(''i''&nbsp;&minus;&nbsp;1,''j''), E(''i'',''j''&nbsp;&minus;&nbsp;1) or ''E''(''i''&nbsp;&minus;&nbsp;1,''j''&nbsp;&minus;&nbsp;1) in computing ''E''(''i'',&nbsp;''j'').
-In the array containing the ''E''(''x'',&nbsp;''y'') values, we then choose the minimal value in the last row, let it be ''E''(''x''<sub>2</sub>,&nbsp;''y''<sub>2</sub>), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was ''E''(''x''<sub>1</sub>,&nbsp;0), then ''T''[''x''<sub>1</sub>&nbsp;+&nbsp;1]&nbsp;...&nbsp;''T''[''y''<sub>2</sub>] is a substring of T with the minimal edit distance to the pattern ''P''.
+In the array containing the ''E''(''x'',&nbsp;''y'') values, we then choose the minimal value in the last row, let it be ''E''(''x''<sub>2</sub>,&nbsp;''y''<sub>2</sub>), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was ''E''(0,&nbsp;''y''<sub>1</sub>), then ''T''[''y''<sub>1</sub>&nbsp;+&nbsp;1]&nbsp;...&nbsp;''T''[''y''<sub>2</sub>] is a substring of T with the minimal edit distance to the pattern ''P''.
 Computing the ''E''(''x'',&nbsp;''y'') array takes [[Big O notation|''O'']](''mn'') time with the dynamic programming algorithm, while the backwards-working phase takes [[Big O notation|''O'']](''n''&nbsp;+&nbsp;''m'') time.
+Another recent idea is the similarity join. When matching database relates to a large scale of data, the [[Big O notation|''O'']](''mn'') time with the dynamic programming algorithm cannot work within a limited time. So, the idea is to reduce the number of candidate pairs, instead of computing the similarity of ''all'' pairs of strings. Widely used algorithms are based on filter-verification, hashing, [[Locality-sensitive hashing]] (LSH), [[Trie]]s and other greedy and approximation algorithms. Most of them are designed to fit some framework (such as Map-Reduce) to compute concurrently.
-==On-line versus off-line==
-Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot.  In other words, on-line techniques do searching without an index. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher{{ref|WF74}} and by Sellers. {{ref|Sel80}} Both algorithms are based on [[dynamic programming]]  but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates [[Levenshtein distance]], being appropriate for dictionary fuzzy search only.
+==Online versus offline==
-On-line searching techniques have been repeatedly improved. Perhaps the most
+Traditionally, approximate string matching algorithms are classified into two categories: [[online algorithm|online]] and offline. With online algorithms the pattern can be processed before searching but the text cannot.  In other words, online techniques do searching without an index. Early algorithms for online approximate matching were suggested by Wagner and Fischer{{sfn|Wagner|Fischer|1974}} and by Sellers.{{sfn|Sellers|1980}} Both algorithms are based on [[dynamic programming]]  but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fischer calculates [[Levenshtein distance]], being appropriate for dictionary fuzzy search only.
-famous improvement is the [[bitap algorithm]] (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The Bitap algorithm is the heart of the [[Unix]] searching [[programming tool|utility]] [[agrep]]. A review of on-line searching algorithms was done by G. Navarro.{{ref|Nav01}}
+Online searching techniques have been repeatedly improved. Perhaps the most famous improvement is the [[bitap algorithm]] (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The bitap algorithm is the heart of the [[Unix]] searching [[programming tool|utility]] [[agrep]]. A review of online searching algorithms was done by G. Navarro.{{sfn|Navarro|2001}}
-Although very fast on-line techniques exist, their
-performance on large data is unacceptable.
-Text preprocessing or [[index (search engine)|indexing]] makes searching dramatically faster.
+Although very fast online techniques exist, their performance on large data is disfavored. Text preprocessing or [[index (search engine)|indexing]] makes searching dramatically faster. Today, a variety of indexing algorithms have been presented. Among them are [[suffix tree]]s,{{sfn|Gusfield|1997}} [[metric tree]]s{{sfn|Baeza-Yates|Navarro|1998}} and [[n-gram]] methods.{{sfn|Navarro|Baeza-Yates|Sutinen|Tarhio|2001}}{{sfn|Zobel|Dart|1995}}
+A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro ''et al.''{{sfn|Navarro|Baeza-Yates|Sutinen|Tarhio|2001}} A computational survey of dictionary methods (i.e., methods that permit finding all dictionary words that approximately match a search pattern) is given by Boytsov.{{sfn|Boytsov|2011}}
-Today, a variety of indexing algorithms have been presented. Among them are [[suffix tree]]s{{ref|Gus97}}, [[metric tree]]s{{ref|NB98}} and [[n-gram]] methods.{{ref|NBST01}}{{ref|Zob95}}
-A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro ''et al.''{{ref|NBST01}}. A computational survey of dictionary methods (i.e., methods that permit finding all dictionary words that approximately match a search pattern) is given by Boytsov {{ref|B11}}.
 == Applications ==
-The most common application of approximate matchers until recently has been [[spell checking]].{{ref|Gus97}} With the availability of large amounts of DNA data, matching of [[nucleotide]] sequences has become an important application.{{ref|CRMN01}} Approximate matching is also used to identify pieces of music from small snatches and in [[spam filtering]].{{ref|Gus97}}
+Common applications of approximate matching include [[spell checking]].{{sfn|Gusfield|1997}} With the availability of large amounts of DNA data, matching of [[nucleotide]] sequences has become an important application.{{sfn|Cormen|Leiserson|2001}} Approximate matching is also used in [[spam filtering]].{{sfn|Gusfield|1997}} [[Record linkage]] is a common application where records from two disparate databases are matched.
+String matching cannot be used for most binary data, such as images and music.  They require different algorithms, such as [[acoustic fingerprint]]ing.
+A common command-line tool ''fzf'' is often used to integrate approximate string searching into various command-line applications.<ref>{{Cite web |date=2018-11-08 |title=Fzf - A Quick Fuzzy File Search from Linux Terminal |url=https://www.tecmint.com/fzf-fuzzy-file-search-from-linux-terminal/ |access-date=2022-09-08 |website=www.tecmint.com |language=en-US}}</ref>
 == See also ==
+{{div col|colwidth=20em}}
-* [[Needleman–Wunsch algorithm]]
-* [[Smith–Waterman algorithm]]
+* [[Concept search]]
+* [[Jaro–Winkler distance]]
 * [[Levenshtein distance]]
+* [[Locality-sensitive hashing]]
 * [[Metaphone]]
+* [[Needleman–Wunsch algorithm]]
+* [[Plagiarism detection]]
+* [[Regular expressions]] for fuzzy and non-fuzzy matching
+* [[Smith–Waterman algorithm]]
 * [[Soundex]]
-* [[Agrep]]
+* [[String metric]]
+* [[Vector database]] for Semantic Similarity Search
-* [http://www.postgresql.org/docs/current/enwiki/static/fuzzystrmatch.html fuzzystrmatch module] in [[PostgreSQL]]
+{{div col end}}
-==References ==
+==References==
+===Citations===
 {{reflist}}
-* {{note|BN96}} {{cite conference |author=Baeza-Yates R, Navarro G |title=A faster algorithm for approximate string matching |editor=Dan Hirchsberg, Gene Myers  |booktitle=Combinatorial Pattern Matching (CPM'96), LNCS 1075  |pages=1–23  |date=June 1996 |location=Irvine, CA }}
+===Works cited===
-* {{note|NB98}} {{cite conference |author=Baeza-Yates R, Navarro G  |title=Fast Approximate String Matching in a Dictionary |booktitle=Proc. SPIRE'98  |pages=14–22 |publisher=IEEE CS Press |url=http://reference.kfupm.edu.sa/content/f/a/fast_approximate_string_matching_in_a_di_105948.pdf}}
+{{refbegin}}
-* {{note|B11}} {{cite journal |author=Boytsov, Leonid|title=Indexing methods for approximate dictionary searching: Comparative analysis|journal=JEA ACM|volume=16 |issue=1 |pages=1–91 |year=2011|url=http://dx.doi.org/10.1145/1963190.1963191}}
+* {{cite conference | last1=Baeza-Yates | first1=R. | last2=Navarro | first2=G. |year=1998 |title=Fast Approximate String Matching in a Dictionary |book-title=Proc. SPIRE'98  |pages=14–22 |publisher=IEEE CS Press |url=http://www.dcc.uchile.cl/~gnavarro/ps/spire98.2.pdf}}
-*{{note|CRMN01}}{{cite book |last=Cormen |first=Thomas |authorlink=Thomas Cormen |coauthors=Leiserson, Rivest |title=Introduction to Algorithms |edition=2nd |year=2001 |publisher=MIT Press |isbn=0-262-03293-7 |pages=364–7}}
+* {{cite journal |last1=Boytsov | first1=Leonid |title=Indexing methods for approximate dictionary searching: Comparative analysis |journal= Journal of Experimental Algorithmics|volume=16 |issue=1 |pages=1–91 |year=2011|doi=10.1145/1963190.1963191| s2cid=15635688 }}
-*{{note|GLL01}}{{cite book |author=Galil, Zvi; Apostolico, Alberto |title=Pattern matching algorithms |publisher=Oxford University Press |location=Oxford [Oxfordshire] |year=1997 |isbn=0-19-511367-5 }}
+* {{cite book |last1=Cormen |first1=Thomas |author-link1=Thomas Cormen |last2=Leiserson |first2=Rivest |title=Introduction to Algorithms |edition=2nd |year=2001 |publisher=MIT Press |isbn=978-0-262-03293-3 |pages=364–7}}
-* {{note|Gus97}}{{cite book |author=Gusfield, Dan |title=Algorithms on strings, trees, and sequences: computer science and computational biology |publisher=Cambridge University Press |location=Cambridge, UK |year=1997 |isbn=0-521-58519-8 }}
+* {{cite book | last1=Gusfield | first1=Dan |title=Algorithms on strings, trees, and sequences: computer science and computational biology |publisher=Cambridge University Press |location=Cambridge, UK |year=1997 |isbn=978-0-521-58519-4 }}
-* {{note|M99}} {{cite journal |author=Myers G |title=A fast bit-vector algorithm for approximate string matching based on dynamic programming |journal=Journal of the ACM |volume=46 |issue=3 |pages=395–415 |month=May |year=1999 |doi=10.1145/316542.316550 }}
-* {{note|Nav01}} {{cite journal |author=Navarro, Gonzalo |title=A guided tour to approximate string matching |journal=ACM Computing Surveys |volume=33 |issue=1 |pages=31–88 |year=2001 |url=http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf |doi=10.1145/375360.375365}}
+* {{cite journal | last1=Navarro | first1=Gonzalo |title=A guided tour to approximate string matching |journal=ACM Computing Surveys |volume=33 |issue=1 |pages=31–88 |year=2001 |citeseerx=10.1.1.96.7225 |doi=10.1145/375360.375365| s2cid=207551224 }}
-* {{note|NBST01}} {{cite journal |author=Navarro, Gonzalo, Ricardo Baeza-Yates, E. Sutinen and J. Tarhio |title=Indexing Methods for Approximate String Matching |journal=IEEE Data Engineering Bulletin |volume=24 |issue=4 |pages=19–27 |year=2001 |url=http://www.dcc.uchile.cl/~gnavarro/ps/deb01.pdf}}
+* {{cite journal | last1=Navarro | first1=Gonzalo | first2=Ricardo | last2=Baeza-Yates | first3=Erkki | last3=Sutinen | first4=Jorma | last4=Tarhio |title=Indexing Methods for Approximate String Matching |journal=IEEE Data Engineering Bulletin |volume=24 |issue=4 |pages=19–27 |year=2001 |url=http://www.dcc.uchile.cl/~gnavarro/ps/deb01.pdf}}
+* {{cite journal |last1=Sellers |first1=Peter H. |title=The Theory and Computation of Evolutionary Distances: Pattern Recognition |journal=Journal of Algorithms |volume=1 |pages=359–73 |year=1980 |doi=10.1016/0196-6774(80)90016-4 |issue=4 }}
-*{{note|OTMN01}}{{cite book |last=Ottmann |first=Thomas |coauthors=Peter Widmayer |title=Algorithmen und Datenstrukturen |edition=4th |year=2002 |publisher=Spektrum Akademischer Verlag |pages=636–9 |language=German |isbn=3-8274-1029-0}}
+*{{note|SKN01}}{{cite book |last=Skiena |first=Steve |title=Algorithm Design Manual |edition=1st |year=1998 |publisher=Springer |isbn=978-0-387-94860-7}}
-* {{note|Sel80}} {{cite journal |author=Sellers, Peter H. |title=The Theory and Computation of Evolutionary Distances: Pattern Recognition |journal=Journal of Algorithms |volume=1 |pages=359–73 |year=1980 |doi=10.1016/0196-6774(80)90016-4 }}
+* {{cite journal | last1=Wagner | first1=R. | last2=Fischer | first2=M. | title=The string-to-string correction problem | journal=Journal of the ACM |volume=21 |pages=168–73 |year=1974 |doi=10.1145/321796.321811| s2cid=13381535 | doi-access=free }}
-*{{note|SKN01}}{{cite book |last=Skiena |first=Steve |title=Algorithm Design Manual |edition=1st |year=1998 |publisher=Springer |isbn=978-0387948607}}
-* {{note|Ukk85}} {{cite journal |author=Ukkonen E |title=Algorithms for approximate string matching |journal=Information and Control |volume=64 |pages=100–18 |year=1985 |doi=10.1016/S0019-9958(85)80046-2 }}
+* {{cite journal | first1=Justin | last1=Zobel | first2=Philip | last2=Dart | title=Finding approximate matches in large lexicons | journal=Software: Practice and Experience | volume=25 | issue=3 | pages=331–345 | year=1995 | doi=10.1002/spe.4380250307 | citeseerx=10.1.1.14.3856 | s2cid=6776819 }}
+{{refend}}
-* {{note|WF74}} {{cite journal |author=Wagner R, Fischer M |title=The string-to-string correction problem |journal=Journal of the ACM |volume=21 |pages=168–73 |year=1974 |url=http://portal.acm.org/citation.cfm?id=321811 |doi=10.1145/321796.321811}}
-* {{note|Zob95}} J. Zobel, P. Dart. Finding approximate matches in large lexicons. Software-Practice & Experience  25(3), pp 331–345, 1995.
+==Further reading==
+{{refbegin}}
+* {{cite conference | last1=Baeza-Yates | first1=R. | last2=Navarro | first2=G. |title=A faster algorithm for approximate string matching |editor1=Dan Hirchsberg |editor2=Gene Myers |book-title=Combinatorial Pattern Matching (CPM'96), LNCS 1075  |pages=1–23  |date=June 1996 |location=Irvine, CA | citeseerx=10.1.1.42.1593 |ref=none}}
+* {{cite book | last1=Galil | first1=Zvi | last2=Apostolico | first2=Alberto |title=Pattern matching algorithms |publisher=Oxford University Press |location=Oxford [Oxfordshire] |year=1997 |isbn=978-0-19-511367-9 |ref=none}}
+* {{cite journal | last1=Myers | first1=G. |title=A fast bit-vector algorithm for approximate string matching based on dynamic programming |journal=Journal of the ACM |volume=46 |issue=3 |pages=395–415 |date=May 1999 |doi=10.1145/316542.316550 | s2cid=1158099 |url=http://www.gersteinlab.org/courses/452/09-spring/pdf/Myers.pdf |ref=none}}
+* {{cite journal | last1=Ukkonen | first1=E. |title=Algorithms for approximate string matching |journal=Information and Control |volume=64 | issue=1–3 |pages=100–18 |year=1985 |doi=10.1016/S0019-9958(85)80046-2 |doi-access=free |ref=none}}
+{{refend}}
+== External links ==
+* [http://flamingo.ics.uci.edu Flamingo Project]
+* [https://web.archive.org/web/20160319053237/http://www.cse.unsw.edu.au/~weiw/project/simjoin.html Efficient Similarity Query Processing Project] with recent advances in approximate string matching based on an edit distance threshold.
+* [http://rockymadden.com/stringmetric/ StringMetric project] a [[Scala programming language|Scala]] library of string metrics and phonetic algorithms
+* [https://github.com/NaturalNode/natural Natural project] a [[JavaScript]] natural language processing library which includes implementations of popular string metrics
+{{strings}}
 {{DEFAULTSORT:Approximate String Matching}}
-[[Category:Searching]]
+[[Category:String matching algorithms|*]]
-[[Category:Algorithms on strings]]
 [[Category:Pattern matching]]
 [[Category:Dynamic programming]]
-[[de:Unscharfe Suche]]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations