Jump to content

Metaphone: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
the author has announced that the patent application on Metaphone 3 will be allowed to lapse. also, description of new alternate encodings edited for accuracy.
No edit summary
Line 39: Line 39:


==Metaphone 3==
==Metaphone 3==
Developed by the same author, this algorithm aims at further improving the accuracy of phonetic encoding of words in the English language. The ability to encode Metaphone keys taking non-initial vowels into account, as well as coding voiced and unvoiced consonant pairs differently, has been added. This allows the result set to be more closely focused if desired. Development for other language versions has been announced. Metaphone 3 is sold as source code in C++, Java and C# for 40 USD each.
Developed by the same author, this algorithm aims at further improving the accuracy of phonetic encoding of words in the English language. The ability to encode Metaphone keys taking non-initial vowels into account, as well as encoding voiced and unvoiced consonants differently, has been added. This allows the result set to be more closely focused if desired. Development for other language versions has been announced. Metaphone 3 is sold as source code in C++, Java and C# for 40 USD each.


==See also==
==See also==

Revision as of 23:04, 12 January 2011

Lawrence Philips redirects here. For the football player, see Lawrence Phillips.

Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. The algorithm produces variable length keys as its output, as opposed to Soundex's fixed-length keys. Similar sounding words share the same keys.

Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation. Metaphone is available as a built-in operator in a number of systems, including later versions of PHP.

The original author later produced a new version of the algorithm, which he named #Double Metaphone, that produces more accurate results than the original algorithm.

Procedure

Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY[1]. The '0' represents "th" (as an ASCII approximation of Θ), 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.[2]

  1. Drop duplicate adjacent letters, except for C.
  2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
  3. Drop 'B' if after 'M' and if it is at the end of the word.
  4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
  5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
  6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
  7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'. Reduce 'GG' to 'G'.
  8. Drop 'H' if after vowel and not before a vowel.
  9. 'CK' transforms to 'K'.
  10. 'PH' transforms to 'F'.
  11. 'Q' transforms to 'K'.
  12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
  13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
  14. 'V' transforms to 'F'.
  15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
  16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
  17. Drop 'Y' if not followed by a vowel.
  18. 'Z' transforms to 'S'.
  19. Drop all vowels unless it is the beginning.

Double Metaphone

The Double Metaphone search algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of C/C++ Users Journal.

It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT--both have XMT in common.

Double Metaphone tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.

Metaphone 3

Developed by the same author, this algorithm aims at further improving the accuracy of phonetic encoding of words in the English language. The ability to encode Metaphone keys taking non-initial vowels into account, as well as encoding voiced and unvoiced consonants differently, has been added. This allows the result set to be more closely focused if desired. Development for other language versions has been announced. Metaphone 3 is sold as source code in C++, Java and C# for 40 USD each.

See also

Implementations

Metaphone Implementations

Double Metaphone Implementations

References