Boyer–Moore string-search algorithm: Difference between revisions

Content deleted Content added

Inline

Revision as of 17:11, 23 August 2010

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature.^[1] It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it's searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

How the algorithm works

-

X

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

The X in position 8 excludes all 8 of the possible starting positions shown.

What people frequently find surprising about the Boyer-Moore algorithm, when they first encounter it, is that its verifications—its attempts to check whether a match exists at a particular position—work backwards. If it starts a search at the beginning of a text for the word "ANPANMAN", for instance, it checks the eighth position of the text to see if it contains an "N". If it finds the "N", it moves to the seventh position to see if that contains the last "A" of the word, and so on until it checks the first position of the text for a "A".

Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails—for instance, if instead of an "N" in the eighth position, we find an "X". The "X" doesn't appear anywhere in "ANPANMAN", and this means there is no match for the search string at the very start of the text—or at the next seven positions following it, since those would all fall across the "X" as well. After checking the eight characters of the word "ANPANMAN" for just one character "X", we're able to skip ahead and start looking for a match ending at the sixteenth position of the text.

This explains why the average-case performance of the algorithm, for a text of length N and a fixed pattern of length M, is N/M: in the best case, only one in M characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.

The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the identity of the character that caused the match attempt to fail; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables in computer science.) The algorithm will shift the larger of the two jump values when a mismatch occurs.

The first table

-

A

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

A

N

P

A

N

M

A

N

-

The mismatch "A" in position 5 (3 back from the last letter of the needle) excludes the first 6 of the possible starting positions shown.

The first table: for each value of i less than the length of the search string, we must first calculate the pattern consisting of the last i characters of the search string, preceded by a mis-match for the character before it; then we initially line it up with the search pattern and determine the least number of characters the partial pattern must be shifted left before the two patterns match. For instance, for the search string ANPANMAN, the table would be as follows: (N signifies any character that is not N)

i	Pattern	Shift
0	N	1
1	AN	8
2	MAN	3
3	NMAN	6
4	ANMAN	6
5	PANMAN	6
6	NPANMAN	6
7	ANPANMAN	6

The amount of shift calculated by the first table is sometimes called the "good suffix shift"^[2] or "(strong) good suffix rule". The original published Boyer-Moore algorithm^[3] uses a simpler, weaker, version of the good suffix rule in which each entry in the above table did not require a mis-match for the left-most character. This is sometimes called the "weak good suffix rule" and is not sufficient for proving that Boyer-Moore runs in linear worst-case time.

The second table

The second table is easy to calculate: Start at the last character of the sought string and move towards the first character. Each time you move left, if the character you are on is not in the table already, add it; its Shift value is its distance from the rightmost character. All other characters receive a count equal to the length of the search string.

Example: For the string ANPANMAN, the first table would be as shown (for clarity, entries are shown in the order they would be added to the table): (The N which is supposed to be zero is based on the 2nd N from the right because we only calculate from letters m-1)

Character	Shift
A	1
M	2
N	3
P	5
all other characters	8

The amount of shift calculated by the first table is sometimes called the "bad character shift".^[2]

Performance of the Boyer-Moore string search algorithm

The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not.^[4] This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than 6*N; in 1980 it was shown to be no more than 4*N, until Cole's result in Sep 1991.

Implementation

C

#include <limits.h>
#include <string.h>

#define ALPHABET_SIZE (1 << CHAR_BIT)

static void compute_prefix(const char* str, size_t size, int result[size]) {
	size_t q;
	int k;
	result[0] = 0;

	k = 0;
	for (q = 1; q < size; q++) {
		while (k > 0 && str[k] != str[q])
			k = result[k-1];

		if (str[k] == str[q])
			k++;
		result[q] = k;
	}
}

static void prepare_badcharacter_heuristic(const char *str, size_t size,
		int result[ALPHABET_SIZE]) {

	size_t i;

	for (i = 0; i < ALPHABET_SIZE; i++)
		result[i] = -1;

	for (i = 0; i < size; i++)
		result[(size_t) str[i]] = i;
}

void prepare_goodsuffix_heuristic(const char *normal, size_t size,
		int result[size + 1]) {

	char *left = (char *) normal;
	char *right = left + size;
	char reversed[size+1];
	char *tmp = reversed + size;
	size_t i;

	/* reverse string */
	*tmp = 0;
	while (left < right)
		*(--tmp) = *(left++);

	int prefix_normal[size];
	int prefix_reversed[size];

	compute_prefix(normal, size, prefix_normal);
	compute_prefix(reversed, size, prefix_reversed);

	for (i = 0; i <= size; i++) {
		result[i] = size - prefix_normal[size-1];
	}

	for (i = 0; i < size; i++) {
		const int j = size - prefix_reversed[i];
		const int k = i - prefix_reversed[i]+1;

		if (result[j] > k)
			result[j] = k;
	}
}
/*
 * Boyer-Moore search algorithm
 */
const char *boyermoore_search(const char *haystack, const char *needle) {
	/*
	 * Calc string sizes
	 */
	size_t needle_len, haystack_len;
	needle_len = strlen(needle);
	haystack_len = strlen(haystack);

	/*
	 * Simple checks
	 */
	if(haystack_len == 0)
		return NULL;
	if(needle_len == 0)
		return haystack;

	/*
	 * Initialize heuristics
	 */
	int badcharacter[ALPHABET_SIZE];
	int goodsuffix[needle_len+1];

	prepare_badcharacter_heuristic(needle, needle_len, badcharacter);
	prepare_goodsuffix_heuristic(needle, needle_len, goodsuffix);

	/*
	 * Boyer-Moore search
	 */
	size_t s = 0;
	while(s <= (haystack_len - needle_len))
	{
		size_t j = needle_len;
		while(j > 0 && needle[j-1] == haystack[s+j-1])
			j--;

		if(j > 0)
		{
			int k = badcharacter[(size_t) haystack[s+j-1]];
			int m;
			if(k < (int)j && (m = j-k-1) > goodsuffix[j])
				s+= m;
			else
				s+= goodsuffix[j];
		}
		else
		{
			return haystack + s;
		}
	}

	/* not found */
	return NULL;
}

Variants

The Turbo Boyer-Moore algorithm takes an additional constant amount of space to complete a search within 2n comparisons (as opposed to 3n for Boyer-Moore), where n is the number of characters in the text to be searched.^[5]

The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out the "first table". The Boyer-Moore-Horspool algorithm requires (in the worst case) mn comparisons, while the Boyer-Moore algorithm requires (in the worst case) only 3n comparisons.

References

^ Hume and Sunday (1991) [Fast String Searching] SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)
^ ^a ^b http://www.movsd.com/bm.htm
^ R. S. Boyer (1977). "A fast string searching algorithm". Comm. ACM. 20: 762–772. doi:10.1145/359842.359859. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Richard Cole (1991). "Tight bounds on the complexity of the Boyer-Moore algorithm". Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 224–233. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ http://www-igm.univ-mlv.fr/~lecroq/string/node15.html

External links

Original article
An example of the Boyer-Moore algorithm from the homepage of J Strother Moore, co-inventor of the algorithm
An explanation of the algorithm (with sample C code)
Cole et al., Tighter lower bounds on the exact complexity of string matching

[1] Hume and Sunday (1991) [Fast String Searching] SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)

[movsd-2] ttp://www.movsd.com/bm.htm

[3] R. S. Boyer (1977). "A fast string searching algorithm". Comm. ACM. 20: 762–772. doi:10.1145/359842.359859. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[4] Richard Cole (1991). "Tight bounds on the complexity of the Boyer-Moore algorithm". Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 224–233. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[5] ttp://www-igm.univ-mlv.fr/~lecroq/string/node15.html

[1]

[2]

[3]

[4]

[5]

@@ Line 2: / Line 2: @@
 {{Cleanup|date=May 2007}}
-The '''Boyer–Moore string search algorithm''' is a particularly efficient [[string searching algorithm]], and it has been the standard benchmark for the practical string search literature.<ref>Hume and Sunday (1991) ''[Fast String Searching]'' SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)</ref> It was developed by [[Bob Boyer]] and [[J Strother Moore]] in 1977.  The [[algorithm]] [[preprocessor|preprocesses]] the target [[string (computer science)|string]] (key) that is being searched ''for'', but not the string being searched ''in'' (unlike some algorithms that preprocess the string to be searched and can then [[amortization|amortize]] the expense of the preprocessing by searching repeatedly).  The execution time of the Boyer-Moore algorithm can be sub-linear: it doesn't need to check every character of the string to be searched, but rather skips over some of them.  Generally the algorithm gets faster as the key being searched for becomes longer.  Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it's searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.
+The '''Boyer–Moore string search algorithm''' is a particularly efficient [[string searching algorithm]], and it has been the standard benchmark for the practical string search literature.<ref>Hume and Sunday (1991) ''[Fast String Searching]'' SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)</ref> It was developed by [[Bob Boyer]] and [[J Strother Moore]] in 1977.  The [[algorithm]] [[preprocessor|preprocesses]] the target [[string (computer science)|string]] (key) that is being searched ''for'', but not the string being searched ''in'' (unlike some algorithms that preprocess the string to be searched and can then [[amortization|amortize]] the expense of the preprocessing by searching repeatedly).  The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them.  Generally the algorithm gets faster as the key being searched for becomes longer.  Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it's searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.
 == How the algorithm works ==