Jump to content

Good–Turing frequency estimation: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Citation bot (talk | contribs)
Add: s2cid, hdl, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Abductive | #UCB_toolbar
Calculation: If it’s N_0, p_0 is always 0, and we gain no information about the likelihood of an unseen species, so it shouldn’t be N_0
Tags: Manual revert Mobile edit Mobile web edit
 
(8 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{EngvarB|date=August 2024}}
{{Short description|Statistical technique invented by developed by Alan Turing and his assistant I. J. Good}}
{{Short description|Statistical technique invented by developed by Alan Turing and his assistant I. J. Good}}
'''Good–Turing frequency estimation''' is a [[statistical]] technique for estimating the [[probability]] of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colors of the balls (finite but unknown in number). After drawing <math>R_\text{red}</math> red balls, <math>R_\text{black}</math> black balls and <math>R_\text{green}</math> green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen color.
'''Good–Turing frequency estimation''' is a [[statistical]] technique for estimating the [[probability]] of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colours of the balls (finite but unknown in number). After drawing <math>R_\text{red}</math> red balls, <math>R_\text{black}</math> black balls and <math>R_\text{green}</math> green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen colour.


==Historical background==
==Historical background==
Good–Turing frequency estimation was developed by [[Alan Turing]] and his assistant [[I. J. Good]] as part of their methods used at [[Bletchley Park]] for cracking [[Germany|German]] [[cipher]]s for the [[Enigma machine]] during [[World War II]]. Turing at first modelled the frequencies as a [[multinomial distribution]], but found it inaccurate. Good developed [[smoothing]] algorithms to improve the estimator's accuracy.
Good–Turing frequency estimation was developed by [[Alan Turing]] and his assistant [[I. J. Good]] as part of their methods used at [[Bletchley Park]] for cracking [[Germany|German]] [[cipher]]s for the [[Enigma machine]] during [[World War II]]. Turing at first modelled the frequencies as a [[multinomial distribution]], but found it inaccurate. Good developed [[smoothing]] algorithms to improve the estimator's accuracy.


The discovery was recognized as significant when published by Good in 1953,<ref name="good">{{cite journal
The discovery was recognised as significant when published by Good in 1953,<ref name="good">{{cite journal
|last=Good |first=I.J. |author-link=I.J. Good
|last=Good |first=I.J. |author-link=I.J. Good
|year=1953
|year=1953
Line 15: Line 16:
</ref> The method even gained some literary fame due to the [[Robert Harris (novelist)|Robert Harris]] novel ''[[Enigma (novel)|Enigma]]''.
</ref> The method even gained some literary fame due to the [[Robert Harris (novelist)|Robert Harris]] novel ''[[Enigma (novel)|Enigma]]''.


In the 1990s, [[Geoffrey Sampson]] worked with [[William A. Gale]] of [[AT&T]] to create and implement a simplified and easier-to-use variant of the Good–Turing method<ref>{{Cite journal|last1=Gale|first1=William A.|last2=Sampson|first2=Geoffrey|date=1995|title=Good‐turing frequency estimation without tears|url=https://www.grsampson.net/AGtf1.html|journal=Journal of Quantitative Linguistics|volume=2|issue=3|pages=217–237|doi=10.1080/09296179508590051|issn=0929-6174}}</ref><ref>{{cite journal |last1=Orlitsky |first1=Alon |last2=Suresh |first2=Ananda |date=2015 |title=Competitive distribution estimation: Why is Good-Turing Good? |url=https://papers.nips.cc/paper/5762-competitive-distribution-estimation-why-is-good-turing-good.pdf |journal=Neural Information Processing Systems (NIPS) |pages=1–9 |access-date=28 March 2016}}</ref> described below. Various heuristic justifications<ref>{{cite journal |last=Nadas |first=A. |title=Good, Jelinek, Mercer, and Robbins on Turing's Estimate of Probabilities |journal=American Journal of Mathematical and Management Sciences |volume=11 |issue=3–4 |pages=299–308 |date=1991 |publisher=American Sciences Press Syracuse, NY, USA |doi=10.1080/01966324.1991.10737313 }}</ref> and a simple combinatorial derivation have been provided.<ref name="hutter">{{cite conference |last=Hutter |first=Marcus |author-link=Marcus Hutter |title=Offline to Online Conversion |date=2014 |series=LNAI |volume=8776 |publisher=Springer |book-title=Proc. 25th International Conf. on Algorithmic Learning Theory (ALT'14) |pages=230–244 |location=Bled, Slovenia|arxiv=1407.3334 |doi=10.1007/978-3-319-11662-4_17 }}</ref>
In the 1990s, [[Geoffrey Sampson]] worked with [[William A. Gale]] of [[AT&T]] to create and implement a simplified and easier-to-use variant of the Good–Turing method<ref>{{Cite journal|last1=Gale|first1=William A.|last2=Sampson|first2=Geoffrey|date=1995|title=Good-turing frequency estimation without tears|url=https://www.grsampson.net/AGtf1.html|journal=Journal of Quantitative Linguistics|volume=2|issue=3|pages=217–237|doi=10.1080/09296179508590051|issn=0929-6174}}</ref><ref>{{cite journal |last1=Orlitsky |first1=Alon |last2=Suresh |first2=Ananda |date=2015 |title=Competitive distribution estimation: Why is Good-Turing Good? |url=https://papers.nips.cc/paper/5762-competitive-distribution-estimation-why-is-good-turing-good.pdf |journal=Neural Information Processing Systems (NIPS) |pages=1–9 |access-date=28 March 2016}}</ref> described below. Various heuristic justifications<ref>{{cite journal |last=Nadas |first=A. |title=Good, Jelinek, Mercer, and Robbins on Turing's Estimate of Probabilities |journal=American Journal of Mathematical and Management Sciences |volume=11 |issue=3–4 |pages=299–308 |date=1991 |publisher=American Sciences Press Syracuse, NY, USA |doi=10.1080/01966324.1991.10737313 }}</ref> and a simple combinatorial derivation have been provided.<ref name="hutter">{{cite conference |last=Hutter |first=Marcus |author-link=Marcus Hutter |title=Offline to Online Conversion |date=2014 |series=LNAI |volume=8776 |publisher=Springer |book-title=Proc. 25th International Conf. on Algorithmic Learning Theory (ALT'14) |pages=230–244 |location=Bled, Slovenia|arxiv=1407.3334 |doi=10.1007/978-3-319-11662-4_17 }}</ref>


==The method==
==The method==
The assumption of Good–Turing estimation is that the number of occurrence for each species follows a binomial distribution.<ref name="cornell">{{cite web |title=The Good–Turing estimate |series=CS&nbsp;6740 |type=course guide |volume=Lecture&nbsp;11 |department=Computer Science |publisher=[[Cornell University]] |place=Ithaca, NY |year=2010 |url=http://www.cs.cornell.edu/courses/cs6740/2010sp/guides/lec11.pdf}}</ref>
The Good–Turing estimator is largely independent of the distribution of species frequencies.<ref name="Good_1953">{{cite journal |last1=GOOD |first1=I. J. |title=The Population Frequencies of Species and the Estimation of Population Parameters |journal=Biometrika |date=1953 |volume=40 |issue=3–4 |pages=237–264 |doi=10.1093/biomet/40.3-4.237 |url=https://academic.oup.com/biomet/article-abstract/40/3-4/237/195998 |access-date=14 September 2022}}</ref>


===Notation===
===Notation===
Line 48: Line 49:
The next step is to estimate the probability that the next observed individual is from a species which has been seen <math>r</math> times. For a ''single'' species this estimate is:
The next step is to estimate the probability that the next observed individual is from a species which has been seen <math>r</math> times. For a ''single'' species this estimate is:
:<math> p_r = \frac{\; (r+1) \; S(N_{r+1}) \;}{N\,S(N_r)} ~.</math>
:<math> p_r = \frac{\; (r+1) \; S(N_{r+1}) \;}{N\,S(N_r)} ~.</math>
Here, the notation <math>S(\cdot)</math> means the ''smoothed'', or ''adjusted'' value of the frequency shown in parenthesis. An overview of how to perform this smoothing [[#smoothing_section_anchor|follows in the next section]] (see also [[empirical Bayes method]]).
Here, the notation <math>S(\cdot)</math> means the ''smoothed'', or ''adjusted'' value of the frequency shown in parentheses. An overview of how to perform this smoothing [[#smoothing_section_anchor|follows in the next section]] (see also [[empirical Bayes method]]).


To estimate the probability that the next observed individual is from any species from this group (i.e., the group of species seen <math>\, r \,</math> times) one can use the following formula:
To estimate the probability that the next observed individual is from any species from this group (i.e., the group of species seen <math>\, r \,</math> times) one can use the following formula:
Line 58: Line 59:
:<math>Z_r = \frac{N_r}{\; \tfrac{1}{2}\,(t - q) \;} ~,</math>
:<math>Z_r = \frac{N_r}{\; \tfrac{1}{2}\,(t - q) \;} ~,</math>


and where {{mvar|q}}, {{mvar|r}}, and {{mvar|t}} are three consecutive subscripts with non-zero counts <math>\; N_q, N_r, N_t \;</math>. For the special case when {{mvar|r}} is 1, take {{mvar|q}} to be 0. In the opposite special case, when <math>\, r = r_\mathsf{last} \;</math> is the index of the ''last'' non-zero count, replace the divisor <math>\, \tfrac{1}{2}\,(t - q) \,</math> with <math>\, r_\mathsf{last} - q \;,</math> so <math>\; Z_{r_\mathsf{last}} = \frac{N_{r_\mathsf{last}}}{\; (r_\mathsf{last} - q) \;} ~.</math>
and where {{mvar|q}}, {{mvar|r}}, and {{mvar|t}} are three consecutive subscripts with non-zero counts <math>\; N_q, N_r, N_t \;</math>. For the special case when {{mvar|r}} is 1, take {{mvar|q}} to be 0. In the opposite special case, when <math>\, r = r_\mathsf{last} \;</math> is the index of the ''last'' non-zero count, replace the divisor <math>\, \tfrac{1}{2}\,(t - q) \,</math> with <math>\, r_\mathsf{last} - q \;,</math> so <math>\; Z_{r_\mathsf{last}} = \frac{N_{r_\mathsf{last}}}{\; r_\mathsf{last} - q \;} ~.</math>


A [[simple linear regression]] is then fitted to the log–log plot.
A [[simple linear regression]] is then fitted to the log–log plot.
Line 69: Line 70:
==Derivation==
==Derivation==


Many different derivations of the above formula for <math>p_r</math> have been given.<ref name="good"/><ref name="hutter"/><ref name="cornell"/><ref>{{cite journal |last1=Favaro |first1=Stefano |last2=Nipoti |first2=Bernardo |last3=Teh |first3=Yee Whye |title=Rediscovery of Good-Turing estimators via Bayesian nonparametrics |journal=Biometrics |volume=72 |number=1 |pages=136–145 |date=2016 |publisher=Wiley Online Library |doi=10.1111/biom.12366 |pmid=26224325 |arxiv=1401.0303 |hdl=2318/1591184 |s2cid=5704019 |url= https://doi.org/10.1111/biom.12366 }}</ref>
Many different derivations of the above formula for <math>p_r</math> have been given.<ref name="good"/><ref name="hutter"/><ref name="cornell">{{cite web |title=The Good–Turing estimate |series=CS&nbsp;6740 |type=course guide |volume=Lecture&nbsp;11 |department=Computer Science |publisher=[[Cornell University]] |place=Ithaca, NY |year=2010 |url=http://www.cs.cornell.edu/courses/cs6740/2010sp/guides/lec11.pdf}}</ref><ref>{{cite journal |last1=Favaro |first1=Stefano |last2=Nipoti |first2=Bernardo |last3=Teh |first3=Yee Whye |title=Rediscovery of Good-Turing estimators via Bayesian nonparametrics |journal=Biometrics |volume=72 |number=1 |pages=136–145 |date=2016 |publisher=Wiley Online Library |doi=10.1111/biom.12366 |pmid=26224325 |arxiv=1401.0303 |hdl=2318/1591184 |s2cid=5704019 |url= https://doi.org/10.1111/biom.12366 }}</ref>


One of the simplest ways to motivate the formula is by assuming the next item will behave similarly to the previous item. The overall idea of the estimator is that currently we are seeing never-seen items at a certain frequency, seen-once items at a certain frequency, seen-twice items at a certain frequency, and so on. Our goal is to estimate just how likely each of these categories is, for the ''next'' item we will see. Put another way, we want to know the current rate at which seen-twice items are becoming seen-thrice items, and so on. Since we don't assume anything about the underlying probability distribution, it does sound a bit mysterious at first. But it is extremely easy to calculate these probabilities ''empirically'' for the ''previous'' item we saw, even assuming we don't remember exactly which item that was: Take all the items we have seen so far (including multiplicities) — the last item we saw was a random one of these, all equally likely. Specifically, the chance that we saw an item for the <math>(r+1)</math>th time is simply the chance that it was one of the items that we have now seen <math>r + 1</math> times, namely <math>\frac{(r+1) N_{r+1}}{N}</math>. In other words, our chance of seeing an item that had been seen ''r'' times before was <math>\frac{(r+1) N_{r+1}}{N}</math>. So now we simply assume that this chance will be about the same for the next item we see. This immediately gives us the formula above for <math>p_0</math>, by setting <math>r=0</math>. And for <math>r>0</math>, to get the probability that ''a particular one'' of the <math>N_{r}</math> items is going to be the next one seen, we need to divide this probability (of seeing ''some'' item that has been seen ''r'' times) among the <math>N_{r}</math> possibilities for which particular item that could be. This gives us the formula <math>\;p_{r} = \frac{(r+1) N_{r+1}}{N N_r}</math>. Of course, your actual data will probably be a bit noisy, so you will want to smooth the values first to get a better estimate of how quickly the category counts are growing, and this gives the formula as shown above. This approach is in the same spirit as deriving the standard [[Bernoulli distribution|Bernoulli]] [[estimator]] by simply asking what the two probabilities were for the previous coin flip (after scrambling the trials seen so far), given only the current result counts, while assuming nothing about the underlying distribution.
One of the simplest ways to motivate the formula is by assuming the next item will behave similarly to the previous item. The overall idea of the estimator is that currently we are seeing never-seen items at a certain frequency, seen-once items at a certain frequency, seen-twice items at a certain frequency, and so on. Our goal is to estimate just how likely each of these categories is, for the ''next'' item we will see. Put another way, we want to know the current rate at which seen-twice items are becoming seen-thrice items, and so on. Since we don't assume anything about the underlying probability distribution, it does sound a bit mysterious at first. But it is extremely easy to calculate these probabilities ''empirically'' for the ''previous'' item we saw, even assuming we don't remember exactly which item that was: Take all the items we have seen so far (including multiplicities) — the last item we saw was a random one of these, all equally likely. Specifically, the chance that we saw an item for the <math>(r+1)</math>th time is simply the chance that it was one of the items that we have now seen <math>r + 1</math> times, namely <math>\frac{(r+1) N_{r+1}}{N}</math>. In other words, our chance of seeing an item that had been seen ''r'' times before was <math>\frac{(r+1) N_{r+1}}{N}</math>. So now we simply assume that this chance will be about the same for the next item we see. This immediately gives us the formula above for <math>p_0</math>, by setting <math>r=0</math>. And for <math>r>0</math>, to get the probability that ''a particular one'' of the <math>N_{r}</math> items is going to be the next one seen, we need to divide this probability (of seeing ''some'' item that has been seen ''r'' times) among the <math>N_{r}</math> possibilities for which particular item that could be. This gives us the formula <math>\;p_{r} = \frac{(r+1) N_{r+1}}{N N_r}</math>. Of course, your actual data will probably be a bit noisy, so you will want to smooth the values first to get a better estimate of how quickly the category counts are growing, and this gives the formula as shown above. This approach is in the same spirit as deriving the standard [[Bernoulli distribution|Bernoulli]] [[estimator]] by simply asking what the two probabilities were for the previous coin flip (after scrambling the trials seen so far), given only the current result counts, while assuming nothing about the underlying distribution.

Latest revision as of 16:08, 17 September 2024

Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colours of the balls (finite but unknown in number). After drawing red balls, black balls and green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen colour.

Historical background

[edit]

Good–Turing frequency estimation was developed by Alan Turing and his assistant I. J. Good as part of their methods used at Bletchley Park for cracking German ciphers for the Enigma machine during World War II. Turing at first modelled the frequencies as a multinomial distribution, but found it inaccurate. Good developed smoothing algorithms to improve the estimator's accuracy.

The discovery was recognised as significant when published by Good in 1953,[1] but the calculations were difficult so it was not used as widely as it might have been.[2] The method even gained some literary fame due to the Robert Harris novel Enigma.

In the 1990s, Geoffrey Sampson worked with William A. Gale of AT&T to create and implement a simplified and easier-to-use variant of the Good–Turing method[3][4] described below. Various heuristic justifications[5] and a simple combinatorial derivation have been provided.[6]

The method

[edit]

The Good–Turing estimator is largely independent of the distribution of species frequencies.[7]

Notation

[edit]
  • Assuming that distinct species have been observed, enumerated
  • Then the frequency vector, has elements that give the number of individuals that have been observed for species
  • The frequency of frequencies vector, shows how many times the frequency r occurs in the vector (i.e. among the elements ):

For example, is the number of species for which only one individual was observed. Note that the total number of objects observed, can be found from

Calculation

[edit]

The first step in the calculation is to estimate the probability that a future observed individual (or the next observed individual) is a member of a thus far unseen species. This estimate is:[8]

The next step is to estimate the probability that the next observed individual is from a species which has been seen times. For a single species this estimate is:

Here, the notation means the smoothed, or adjusted value of the frequency shown in parentheses. An overview of how to perform this smoothing follows in the next section (see also empirical Bayes method).

To estimate the probability that the next observed individual is from any species from this group (i.e., the group of species seen times) one can use the following formula:

Smoothing

[edit]

For smoothing the erratic values in for large r, we would like to make a plot of versus but this is problematic because for large many will be zero. Instead a revised quantity, is plotted versus where is defined as

and where q, r, and t are three consecutive subscripts with non-zero counts . For the special case when r is 1, take q to be 0. In the opposite special case, when is the index of the last non-zero count, replace the divisor with so

A simple linear regression is then fitted to the log–log plot.

For small values of it is reasonable to set – that is, no smoothing is performed.

For large values of r, values of are read off the regression line. An automatic procedure (not described here) can be used to specify at what point the switch from no smoothing to linear smoothing should take place.[9][full citation needed] Code for the method is available in the public domain.[10]

Derivation

[edit]

Many different derivations of the above formula for have been given.[1][6][11][12]

One of the simplest ways to motivate the formula is by assuming the next item will behave similarly to the previous item. The overall idea of the estimator is that currently we are seeing never-seen items at a certain frequency, seen-once items at a certain frequency, seen-twice items at a certain frequency, and so on. Our goal is to estimate just how likely each of these categories is, for the next item we will see. Put another way, we want to know the current rate at which seen-twice items are becoming seen-thrice items, and so on. Since we don't assume anything about the underlying probability distribution, it does sound a bit mysterious at first. But it is extremely easy to calculate these probabilities empirically for the previous item we saw, even assuming we don't remember exactly which item that was: Take all the items we have seen so far (including multiplicities) — the last item we saw was a random one of these, all equally likely. Specifically, the chance that we saw an item for the th time is simply the chance that it was one of the items that we have now seen times, namely . In other words, our chance of seeing an item that had been seen r times before was . So now we simply assume that this chance will be about the same for the next item we see. This immediately gives us the formula above for , by setting . And for , to get the probability that a particular one of the items is going to be the next one seen, we need to divide this probability (of seeing some item that has been seen r times) among the possibilities for which particular item that could be. This gives us the formula . Of course, your actual data will probably be a bit noisy, so you will want to smooth the values first to get a better estimate of how quickly the category counts are growing, and this gives the formula as shown above. This approach is in the same spirit as deriving the standard Bernoulli estimator by simply asking what the two probabilities were for the previous coin flip (after scrambling the trials seen so far), given only the current result counts, while assuming nothing about the underlying distribution.

See also

[edit]

References

[edit]
  1. ^ a b Good, I.J. (1953). "The population frequencies of species and the estimation of population parameters". Biometrika. 40 (3–4): 237–264. doi:10.1093/biomet/40.3-4.237. JSTOR 2333344. MR 0061330.
  2. ^ Newsise: Scientists Explain and Improve Upon 'Enigmatic' Probability Formula, a popular review of Orlitsky A, Santhanam NP, Zhang J (2003). "Always Good Turing: asymptotically optimal probability estimation". Science. 302 (5644): 427–31. Bibcode:2003Sci...302..427O. doi:10.1126/science.1088284. PMID 14564004.
  3. ^ Gale, William A.; Sampson, Geoffrey (1995). "Good-turing frequency estimation without tears". Journal of Quantitative Linguistics. 2 (3): 217–237. doi:10.1080/09296179508590051. ISSN 0929-6174.
  4. ^ Orlitsky, Alon; Suresh, Ananda (2015). "Competitive distribution estimation: Why is Good-Turing Good?" (PDF). Neural Information Processing Systems (NIPS): 1–9. Retrieved 28 March 2016.
  5. ^ Nadas, A. (1991). "Good, Jelinek, Mercer, and Robbins on Turing's Estimate of Probabilities". American Journal of Mathematical and Management Sciences. 11 (3–4). American Sciences Press Syracuse, NY, USA: 299–308. doi:10.1080/01966324.1991.10737313.
  6. ^ a b Hutter, Marcus (2014). "Offline to Online Conversion". Proc. 25th International Conf. on Algorithmic Learning Theory (ALT'14). LNAI. Vol. 8776. Bled, Slovenia: Springer. pp. 230–244. arXiv:1407.3334. doi:10.1007/978-3-319-11662-4_17.
  7. ^ GOOD, I. J. (1953). "The Population Frequencies of Species and the Estimation of Population Parameters". Biometrika. 40 (3–4): 237–264. doi:10.1093/biomet/40.3-4.237. Retrieved 14 September 2022.
  8. ^ Gale, William A. (1995). "Good–Turing smoothing without tears". Journal of Quantitative Linguistics. 2 (3): 3. CiteSeerX 10.1.1.110.8518. doi:10.1080/09296179508590051.
  9. ^ Church, K.; Gale, W. (1991). "A comparison of the enhanced Good–Turing and deleted estimation methods for estimating probabilities of English bigrams". {{cite journal}}: Cite journal requires |journal= (help)
  10. ^ Sampson, Geoffrey (2005). "Simple Good–Turing frequency estimator". grsampson.net (source code in C).
  11. ^ "The Good–Turing estimate" (PDF). Computer Science (course guide). CS 6740. Ithaca, NY: Cornell University. 2010.
  12. ^ Favaro, Stefano; Nipoti, Bernardo; Teh, Yee Whye (2016). "Rediscovery of Good-Turing estimators via Bayesian nonparametrics". Biometrics. 72 (1). Wiley Online Library: 136–145. arXiv:1401.0303. doi:10.1111/biom.12366. hdl:2318/1591184. PMID 26224325. S2CID 5704019.

Bibliography

[edit]