Jump to content

Good–Turing frequency estimation: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Historical background: Specified as it was linking to William G. Gale the economist
Line 15: Line 15:
Let us define some data structures and notation.
Let us define some data structures and notation.


Assume we have observed X distinct species, numbered x = 1, ... , X
Assume we have observed ''X'' distinct species, numbered ''x'' = 1, ... , ''X''


The frequency vector <math>R_x</math> gives the number of objects we have observed for species x
The frequency vector <math>R_x</math> gives the number of objects we have observed for species ''x''.


The frequency of frequencies vector <math>N_r</math> shows how many times the frequency r occurs in the vector <math>R_x</math>. For example <math>N_1</math> is the number of species for which only 1 object was observed.
The frequency of frequencies vector <math>N_r</math> shows how many times the frequency ''r'' occurs in the vector <math>R_x</math>. For example <math>N_1</math> is the number of species for which only 1 object was observed.


Note that the total number of objects observed, N, can be found from <math>N = \sum r N_r</math>.
Note that the total number of objects observed, N, can be found from <math>N = \sum r N_r</math>.
Line 26: Line 26:
This estimate is <math>p_0 = N_1 / N </math>
This estimate is <math>p_0 = N_1 / N </math>


The next step is to find an estimate of probability for objects which were seen r times,
The next step is to find an estimate of probability for objects which were seen r times, this estimate is <math>p_r = \frac{(r+1) S(N_{r+1})}{N S(N_r)}</math>.
this estimate is <math>p_r = \frac{(r+1) S(N_{r+1})}{N S(N_r)}</math>


The notation <math>S( )</math> means the smoothed or adjusted value of the frequency shown in parenthesis. An overview of how to perform this smoothing will now be given.
The notation <math>S( )</math> means the smoothed or adjusted value of the frequency shown in parenthesis. An overview of how to perform this smoothing will now be given.


We would like to make a plot of <math>\log N_r</math> versus <math>\log r</math> but this is problematic because for
We would like to make a plot of <math>\log N_r</math> versus <math>\log r</math> but this is problematic because for
large r many <math>N_r</math> will be zero. Instead we plot <math>\log Z_r</math> versus <math>\log r</math> where ''Z'' is defined as
large ''r'' many <math>N_r</math> will be zero. Instead we plot <math>\log Z_r</math> versus <math>\log r</math> where ''Z'' is defined as


:<math>Z_r = \frac{N_r}{0.5(t-q)} </math> where q, r and t are consecutive subscripts having <math>N_q, N_r, N_t</math> non zero.
:<math>Z_r = \frac{N_r}{0.5(t-q)} </math>
where ''q'', ''r'' and ''t'' are consecutive subscripts having <math>N_q, N_r, N_t</math> non-zero.


A linear regression is then fit to the log-log plot. For small values of r we take <math>S(N_r) = N_r</math>
A linear regression is then fit to the log-log plot. For small values of r we take <math>S(N_r) = N_r</math>
(that is, no smoothing is performed), while for large values on r <math>S(N_r)</math> is read off the
(that is, no smoothing is performed), while for large values on ''r'' <math>S(N_r)</math> is read off the
regression line. An automatic procedure (not described here) specifies at what point the
regression line. An automatic procedure (not described here) specifies at what point the switch from no smoothing to linear smoothing should take place.
switch from no smoothing to linear smoothing should take place.


Code for the method is available in the public domain.
Code for the method is available in the public domain.

Revision as of 22:42, 4 May 2007

Good–Turing frequency estimation is a statistical technique for predicting the probability of occurrence of objects belonging to an unknown number of species, given past observations of such objects and their species. (In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colors of the balls (finite but unknown in number). After drawing red balls, black balls and green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen color.)

Historical background

GTFE was developed by Alan Turing and his assistant I.J. Good during World War II. (It was part of their efforts at Bletchley Park to crack German ciphers for the Enigma machine during the war). Turing at first modeled the frequencies as a binomial distribution, but found it inaccurate. Good developed smoothing algorithms to improve the estimator's accuracy.

The discovery was recognized as significant when published by Good in 1953, but the calculations were difficult so it was not used as widely as it might have been.[1]

The method even gained some literary fame due to the Robert Harris novel Enigma.

In the 1990s, Geoffrey Sampson worked with William A. Gale of AT&T, to create and implement a simplified and easier to use variant of the Good-Turing method [2] described below.

The method

Let us define some data structures and notation.

Assume we have observed X distinct species, numbered x = 1, ... , X

The frequency vector gives the number of objects we have observed for species x.

The frequency of frequencies vector shows how many times the frequency r occurs in the vector . For example is the number of species for which only 1 object was observed.

Note that the total number of objects observed, N, can be found from .

The first step in the calculation is to find an estimate of the total probability of unseen objects. This estimate is

The next step is to find an estimate of probability for objects which were seen r times, this estimate is .

The notation means the smoothed or adjusted value of the frequency shown in parenthesis. An overview of how to perform this smoothing will now be given.

We would like to make a plot of versus but this is problematic because for large r many will be zero. Instead we plot versus where Z is defined as

where q, r and t are consecutive subscripts having non-zero.

A linear regression is then fit to the log-log plot. For small values of r we take (that is, no smoothing is performed), while for large values on r is read off the regression line. An automatic procedure (not described here) specifies at what point the switch from no smoothing to linear smoothing should take place.

Code for the method is available in the public domain.

References

See also