Jump to content

Talk:Good–Turing frequency estimation: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Srchvrs (talk | contribs)
Adding class & importance to statistics rating template
Line 1: Line 1:
{{WPStatistics}}
{{WPStatistics|class=start|importance=low}}


This needs a description of the actual technique used by Good and Turing, which is notable by its absence from this article. -- [[User:The Anome|The Anome]] 00:25, 27 October 2006 (UTC)
This needs a description of the actual technique used by Good and Turing, which is notable by its absence from this article. -- [[User:The Anome|The Anome]] 00:25, 27 October 2006 (UTC)

Revision as of 20:26, 3 December 2013

WikiProject iconStatistics Start‑class Low‑importance
WikiProject iconThis article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StartThis article has been rated as Start-class on Wikipedia's content assessment scale.
LowThis article has been rated as Low-importance on the importance scale.

This needs a description of the actual technique used by Good and Turing, which is notable by its absence from this article. -- The Anome 00:25, 27 October 2006 (UTC)[reply]

You are right and this has now been (partly) rectified Encyclops 01:17, 11 February 2007 (UTC)[reply]

Good Turing Smoothing

Jurafski and Martin's authorative work "Speech and Language Processing" (Chapter 6, section 6.3) mentions a Good-Turing Smoothing algorithm which seems to have some relevance to this article but doesn't entirely fit the formulas described here as far as I can tell. Does anyone know how or if it is related, and whether it should be incorporated here or in a separate article? Gijs Kruitbosch 00:41, 11 May 2007 (UTC)[reply]

Derivation of the estimates?

According to the article:

The first step in the calculation is to find an estimate of the total probability of unseen objects. This estimate is
The next step is to find an estimate of probability for objects which were seen r times, this estimate is

Where do these estimates come from? What prior probabilities are being assumed? -- Jheald 10:13, 4 March 2007 (UTC)[reply]

Do we define p_r in the article? Depending on the definition this formula is either correct or wrong.— Preceding unsigned comment added by Srchvrs (talkcontribs) 03:41, 11 February 2013 (UTC)[reply]

It is not a Bayesian scheme that starts with prior probabilities. Why it works is somewhat of a mystery and perhaps still an open research problem (although Orlitsky, in the first cited reference [[1]], claims to have figured it out. I haven't read his work).
Using the number of species that have been seen only once as an estimator of the number of species that have not been seen yet seems reasonable, but I don't know a formal justification. If there are a lot of species that we have observed only once, then we must still be in the early stages of species discovery, so there are probably a lot of species out there that we have not seen yet.
The second formula is even less intuitive. It has to do with 'adjusting' the observed counts in a certain manner. Perhaps it would help to add some discussion, but I don't feel qualified to do that. Encyclops 16:34, 4 March 2007 (UTC)[reply]
The formulas and procedure don't precisely agree with my understanding of Turing's original work, which was based on a general theory of distributions of distributions of ... which so far as I know was never published and may still be classified.
I agree that it is not a Bayesian scheme as such.
A reasonable intuitive estimate for the likelihood of any single specified species which has not yet been seen (e.g. a purple ball) would seem to be , reasoning that in the absence of any other information, the current set of observations is on the border between having seen one of that species and having seen none of that species, and indeed this assumption has been heavily used by cryptanalysts when they know the full set of species in advance. (The interesting thing about Good–Turning is that the full set of species is not known in advance.) So the formula amounts to saying that the expected number of unseen species is , which is interesting in its own right (and still needs justification, which I'm not prepared to provide).
Somewhere at home I have a copy of the original Biometrika paper, which should be cited in the article as the initial publication concerning this topic. — DAGwyn (talk) 21:41, 14 February 2008 (UTC)[reply]
In case it helps you find it,here is a reference that might be what you are looking for: I. J. Good: The population frequencies of species and the estimation of population parameters. Biometrika, 40:237--264, Dec 1953. Encyclops (talk) 23:12, 14 February 2008 (UTC)[reply]

The formal similarity between the formula being asked about here, and that of Robbins' empirical Bayes methods, is obvious. Perhaps that can answer the question. I'll look at this more closely and then opine further. Michael Hardy 22:56, 4 May 2007 (UTC)[reply]

Example plot request

The following is copied from the Stats project talk page. Melcombe (talk) 14:23, 2 December 2010 (UTC)[reply]

"Instead we plot [...]"

Shouldn't there be a plot? Please add some illustrations or adjust the text. Thank you! --Peni (talk) 14:01, 1 December 2010 (UTC)[reply]

P.S. Source: Good-Turing smoothing without tears, William A. Gale Journal of Quantitative Linguistics, 1995. --Peni (talk) 15:39, 1 December 2010 (UTC)[reply]
Be bold. --Qwfp (talk) 17:27, 1 December 2010 (UTC)[reply]

(end copy)

The Novel by Robert Harris

Has anyone read the novel by Robert Harris? Is it relevant to the topic of this article or should the reference be removed? The comment "The book, though fiction, is criticised by people who were at Bletchley Park as bearing little resemblance to the real wartime Bletchley Park" makes me doubt that the authour has any valid technical or historical points to offer on the subject at hand. Encyclops (talk) 16:54, 30 March 2011 (UTC)[reply]