Talk:Good–Turing frequency estimation: Difference between revisions
m Signing comment by Srchvrs - "→Derivation of the estimates?: " |
|||
Line 19: | Line 19: | ||
Where do these estimates come from? What [[prior probabilities]] are being assumed? -- [[User:Jheald|Jheald]] 10:13, 4 March 2007 (UTC) |
Where do these estimates come from? What [[prior probabilities]] are being assumed? -- [[User:Jheald|Jheald]] 10:13, 4 March 2007 (UTC) |
||
I think that the formula is wrong. There should not be S(N_r) in the denominator. Try apply it yourself to any Zipfian distributions, e.g., this one http://www.cs.cmu.edu/~roni/11761/ <small><span class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Srchvrs|Srchvrs]] ([[User talk:Srchvrs|talk]] • [[Special:Contributions/Srchvrs|contribs]]) 03:41, 11 February 2013 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot--> |
I think that the formula is wrong. There should not be S(N_r) in the denominator. Try apply it yourself to any Zipfian distributions, e.g., this one http://www.cs.cmu.edu/~roni/11761/. Besides, if you look at the original Good paper, its main result is, essentially, that S_r * r is approximately equal S_{r+1} * {r+1}. So, you can estimate S_r * r/N as S_{r+1}*(r+1)/N <small><span class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Srchvrs|Srchvrs]] ([[User talk:Srchvrs|talk]] • [[Special:Contributions/Srchvrs|contribs]]) 03:41, 11 February 2013 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot--> |
||
:It is not a [[Bayesian]] scheme that starts with prior probabilities. Why it works is somewhat of a mystery and perhaps still an open research problem (although Orlitsky, in the first cited reference [[http://www.newswise.com/articles/view/501440/]], claims to have figured it out. I haven't read his work). |
:It is not a [[Bayesian]] scheme that starts with prior probabilities. Why it works is somewhat of a mystery and perhaps still an open research problem (although Orlitsky, in the first cited reference [[http://www.newswise.com/articles/view/501440/]], claims to have figured it out. I haven't read his work). |
Revision as of 03:53, 11 February 2013
Statistics Unassessed | ||||||||||
|
This needs a description of the actual technique used by Good and Turing, which is notable by its absence from this article. -- The Anome 00:25, 27 October 2006 (UTC)
- You are right and this has now been (partly) rectified Encyclops 01:17, 11 February 2007 (UTC)
Good Turing Smoothing
Jurafski and Martin's authorative work "Speech and Language Processing" (Chapter 6, section 6.3) mentions a Good-Turing Smoothing algorithm which seems to have some relevance to this article but doesn't entirely fit the formulas described here as far as I can tell. Does anyone know how or if it is related, and whether it should be incorporated here or in a separate article? Gijs Kruitbosch 00:41, 11 May 2007 (UTC)
Derivation of the estimates?
According to the article:
- The first step in the calculation is to find an estimate of the total probability of unseen objects. This estimate is
- The next step is to find an estimate of probability for objects which were seen r times, this estimate is
Where do these estimates come from? What prior probabilities are being assumed? -- Jheald 10:13, 4 March 2007 (UTC)
I think that the formula is wrong. There should not be S(N_r) in the denominator. Try apply it yourself to any Zipfian distributions, e.g., this one http://www.cs.cmu.edu/~roni/11761/. Besides, if you look at the original Good paper, its main result is, essentially, that S_r * r is approximately equal S_{r+1} * {r+1}. So, you can estimate S_r * r/N as S_{r+1}*(r+1)/N — Preceding unsigned comment added by Srchvrs (talk • contribs) 03:41, 11 February 2013 (UTC)
- It is not a Bayesian scheme that starts with prior probabilities. Why it works is somewhat of a mystery and perhaps still an open research problem (although Orlitsky, in the first cited reference [[1]], claims to have figured it out. I haven't read his work).
- Using the number of species that have been seen only once as an estimator of the number of species that have not been seen yet seems reasonable, but I don't know a formal justification. If there are a lot of species that we have observed only once, then we must still be in the early stages of species discovery, so there are probably a lot of species out there that we have not seen yet.
- The second formula is even less intuitive. It has to do with 'adjusting' the observed counts in a certain manner. Perhaps it would help to add some discussion, but I don't feel qualified to do that. Encyclops 16:34, 4 March 2007 (UTC)
- The formulas and procedure don't precisely agree with my understanding of Turing's original work, which was based on a general theory of distributions of distributions of ... which so far as I know was never published and may still be classified.
- I agree that it is not a Bayesian scheme as such.
- A reasonable intuitive estimate for the likelihood of any single specified species which has not yet been seen (e.g. a purple ball) would seem to be , reasoning that in the absence of any other information, the current set of observations is on the border between having seen one of that species and having seen none of that species, and indeed this assumption has been heavily used by cryptanalysts when they know the full set of species in advance. (The interesting thing about Good–Turning is that the full set of species is not known in advance.) So the formula amounts to saying that the expected number of unseen species is , which is interesting in its own right (and still needs justification, which I'm not prepared to provide).
- Somewhere at home I have a copy of the original Biometrika paper, which should be cited in the article as the initial publication concerning this topic. — DAGwyn (talk) 21:41, 14 February 2008 (UTC)
- In case it helps you find it,here is a reference that might be what you are looking for: I. J. Good: The population frequencies of species and the estimation of population parameters. Biometrika, 40:237--264, Dec 1953. Encyclops (talk) 23:12, 14 February 2008 (UTC)
The formal similarity between the formula being asked about here, and that of Robbins' empirical Bayes methods, is obvious. Perhaps that can answer the question. I'll look at this more closely and then opine further. Michael Hardy 22:56, 4 May 2007 (UTC)
Example plot request
The following is copied from the Stats project talk page. Melcombe (talk) 14:23, 2 December 2010 (UTC)
- "Instead we plot [...]"
Shouldn't there be a plot? Please add some illustrations or adjust the text. Thank you! --Peni (talk) 14:01, 1 December 2010 (UTC)
- P.S. Source: Good-Turing smoothing without tears, William A. Gale Journal of Quantitative Linguistics, 1995. --Peni (talk) 15:39, 1 December 2010 (UTC)
(end copy)
The Novel by Robert Harris
Has anyone read the novel by Robert Harris? Is it relevant to the topic of this article or should the reference be removed? The comment "The book, though fiction, is criticised by people who were at Bletchley Park as bearing little resemblance to the real wartime Bletchley Park" makes me doubt that the authour has any valid technical or historical points to offer on the subject at hand. Encyclops (talk) 16:54, 30 March 2011 (UTC)