Talk:Statistics

This article has been mentioned by a media organization:

Kathy Lange (December 1, 2006). "Differences Between Statistics and Data Mining". http://www.dmreview.com/ DM Review. {{cite news}}: External link in |agency= (help)

Mathematics B‑class Top‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-priority on the project's priority scale.

Statistics was a good article, but it was removed from the list as it no longer met the good article criteria at the time. There are suggestions below for improving the article. If you can improve it, please do; it may then be renominated.
Review: Error: Invalid time..

Template:WP1.0

This page is for discussion of the article about statistics. Comments and questions about the special page about Wikipedia site statistics (number of pages, edits, etc.) should be directed to Wikipedia talk:Special pages.

Please add new comments at the bottom of this page.

Question

Was wondering if there was a name for the statistical principle that maintains that the more data points you have, the more reliable your dataset will be... Thanks.Jefferson61345 02:30, 8 August 2007 (UTC)[reply]

Fallacy?

Statistics can be easily deemed a fallacy. If statistics say that kids whose parents don't talk to them about not smoking are more likely to smoke (you know the common argument), that is a fallacy. Yes, it may be a true statement, but it cannot be argued that the kids whose parents tell them not to smoke would not find smoking cool and that the kids whose parents didn't tell them not to smoke may decide may feel it is disgusting. Statistics as a field tend to treat all people as equal in all regards when that is clearly not true. Not everybody can throw 49 touchdown passes in an NFL season like Peyton Manning did in 2004 or be the leading goal scorer at the Soccer World Cup. I just figured this might be an idea to consider discussing in the article, even though it may be difficult to find a decent source. 205.166.61.142 00:31, 31 August 2006 (UTC)[reply]

You make some sweeping generalizations. One of the purposes of statistics is to attempt to explain an outcome with the most explanatory variables. If a certain type of person is more likely to have a certain kind of outcome (for example, black men tend to have more cardiovascular problems), it is in the best interest of such research to treat everyone differently, not the same. Statistics such as the t-test and ANOVA often differentiate people more than treat them the same. I think your football analogy may be one of the fallacies you are talking about. Football statistics are descriptive statistics--they only describe those people to which they apply (in your case, professional football players and nobody else). Inferential statistics, such as the t-test, often group people according to like kinds based on particular variables, like incidence rate of cardiovascular health problems. Chris53516 13:43, 31 August 2006 (UTC)[reply]

Let me add to that answer in case the poser of the question returns. Statisical methods are not (correctly) used to prove cause and effect or to make claims that something is always true. Statistics is more of an art of educated guessing where mathematical methods are used to make best decisions about what is most likely or what tends to be related. In fact, built into the methods of statistics are ways of determining how likely you are to make an error in your "educated guessing". Typically, someone using statistical methods correctly will say, "I am 99% sure that these two factors (such as not smoking and parents telling the child not to smoke) are related to each other." Then qualifiers will be added. Even in that case, a good statistician wouldn't claim that one factor causes the other. It could be that both items are caused by some third, unidentified, factor. But, of course, those types of misinterpretations of statistical results are made all the time. That doesn't mean, however, that the cause and effect is not logically the best interpretation to the situation. Suppose, for example, that a large number of people get sick who mostly all ate spinach. We might make a best guess that spinach caused the illness. But, really it might be something else like a common salad dressing used by spinach lovers or the fact that spinach stuck in their teeth chased away potential romantic relationships leaving the spinach-eaters in a heart-sick condition which eventually led to real illness. Of course, those alternatives are ridiculous. I guess they COULD be true, but most people would go with the theory that the spinach was teinted. And even if the spinach was the problem, it could be that, for some, there was another unidentified cause. So, we are left with concluding, "Probably this is the cause most of the time." --Newideas07 21:48, 3 November 2006 (UTC)[reply]

Need Link to Reliability (statistics) page

This page needs links to the pages on Reliability (statistics) and Factor Analysis. I'm not sure if these should be put under Statistical Techniques or See Also. I'm also wondering if there should be a link to Cronbach's Alpha (which is one type of reliability estimate).

It seems to me that there are probably quite a few statistical techniques that are not linked from this page. Perhaps it would be helpful to create a hierarchical index of statistical techniques. I see that something like this can be done in the Table of contents. Kbarchard 22:24, 16 September 2006 (UTC)[reply]

This page is not a list of statistical topics (which we link to in the "See also" section), and not every statistical technique or estimator needs to be listed here. The ones you mention seem a bit too specialised for a general article on statistics, but could be usefully added to articles like multivariate analysis and social statistics. -- Avenue 01:34, 18 September 2006 (UTC)[reply]

Standardized coefficient for DYK

I wrote an aricle on Standardized coefficient, but I am no expert in statistics. If this could be quickly vetted by an editor more experienced with this field, we could have a statistical WP:DYK.--_{Piotr Konieczny aka Prokonsul Piotrus | talk} 20:25, 7 October 2006 (UTC)[reply]

What is the difference between F(x) and f(x)?

Can somebody please explain to me with an example the difference between F(x) and f(x) for a continuous random variable? As far as I understand f(x) is a derivative of F(x), please correct me if I am wrong, but that is not sufficient enough for understanding the whole process. Many thanks. -Chetan. — Preceding unsigned comment added by Chetanpatel13 (talk • contribs)

Those two should be interchangable, as far as I know. By the way, use four ~ to sign with your user ID. Chris53516 17:07, 18 October 2006 (UTC)[reply]

Chris, thanks for the response, BTW they are very different. Thanks for the tip and hopefully I am doing it right this time. -- Chetan M Patel 18:24, 18 October 2006 (UTC)[reply]

How are they different? Please use 4 ~ to sign your name. It's easier than what you did. Chris53516 18:31, 18 October 2006 (UTC)[reply]

f(x) is probability density function (PDF) whereas, F(x) is cumulative distribution function (CDF). Chetan M Patel 18:58, 18 October 2006 (UTC)[reply]

The names of the functions are a convention, widely used in statistics. Perhaphs a better question is: whats the difference between a PDF and CDF? Its probably easiest to understand if you know about integration with

F(u)=\int _{x=-\infty }^{x}f(x)dx

. As we are working over a continuous domain the chance of a random variable taking a particular real-value, 0.123456789 say, is zero so it only makes sense to talk of probabilities calculated over a range of values and its a convention to use the range

[-\infty ,x]

giving the CDF. So yes

f(x)={dF \over dx}

. What is the meaning of the PDF, well if you consider a discrete probability distribution like the binomial distribution then the PDF is just the probability of a particular number, here the probabilities of a particualr number 0,1,2,3 occuring is non zero. Futhermore, PDF is useful for visulising the shape of a distribution, for the normal distribution it gives the familiar bell shaped curve, the CDF would be S-shaped and its harder to see whats happening. --Salix alba (talk) 20:45, 18 October 2006 (UTC)[reply]

Correction: that should be

F(u)=\int _{x=-\infty }^{u}f(x)dx

. The upper bound of integration must be u if F(u) is what you're evaluating. Michael Hardy 22:47, 18 October 2006 (UTC)[reply]

In case anyone wants a "Statistics for Dummies" explanation of all that: f(x) is the drawing of a curve that defines a certain probability density function (pattern). For example, a bell shaped curved has an equation, f(x), and represents a situation in which falling in the middle of some range is most likely with tapering probabilities as you go to the left or right. Most measurements of objects fall in this category. But, probabilities of having x in some range are found by calculating the area under the curve. To find the area under the curve, you have to integrate f(x) to get F(x). Sometimes, that is impossible or just really hard and so approximation techniques are used instead, which is why one reason why you usually get probabilities out of tables instead of using equations. There are other theoretical uses for the two functions. I'm not sure if that clarified things for anyone. --Newideas07 21:23, 3 November 2006 (UTC)[reply]

In case that didn't clarify things for some people, the 'statistics for dummies for dummies' version is that the pdf is the height of the density at a given point, whereas the cdf is the area under the curve fro a range of points. For example, if we want to know the probability of a person being 5'9" tall, that's a question for a pdf (f(x); if we want to know the probablity of being 5'9" or less, that's a cdf (F(x)). Plf515 02:09, 24 November 2006 (UTC)plf515[reply]

Name of Etymology subsection

Etymology here is the study of the history of the word statistics, not the history of statistics itself. The first paragraph or so of the current Etymology subsection is etymology, but the later paragraphs go beyond etymology to actual history of statistics. That's why I think there are many better, broader titles for this subsection. Or maybe I am interpreting etymology too narrowly? Joshua Davis 15:11, 21 October 2006 (UTC)[reply]

I think Etymology works, even if it does go beyond simple etymology. It's still related to the word's history. -- Chris53516 16:04, 22 October 2006 (UTC)[reply]

I agree that Etymology was not an accurate description here. I've tried to remedy the situation somewhat by moving some of this material to the Statistics Today section. I also removed a reference to Michel Foucault, which does not seem to me to belong here at all. Thefellswooper 22:06, 31 March 2007 (UTC)[reply]

Criticism

I would like to propose we change the name of this section to "The Misuse and Limitations of Statistics" or something similar as Joshua suggested. I also would like to make big revisions to it if no one is working on it or attached to it as it is. I'm a statistician (M.S.) and educator. If anyone objects or has a better idea or is already working away hard on this, speak soon or I'll do it. --Newideas07 22:04, 3 November 2006 (UTC)[reply]

I think that is a good topic, but for a separate article. There are certainly lots of abuses of statistics, but this page seems fine to me, needing only minor edits. Plf515 02:34, 24 November 2006 (UTC)plf515[reply]

Note about archives

I used a method that others may not like. If someone else wants to change the archive, find and copy any new comments, and begin at this page to do so: Start of archiving. Thanks for being patient while I made these archives. -- Chris53516 (Talk) 23:01, 3 November 2006 (UTC)[reply]

Merge from applied statistics

There was a suggestion at Talk:Applied statistics to merge into this article - it's only a stub but it may have some potential. I'll leave it for the statisticians here to decide. Richard001 19:53, 6 February 2007 (UTC)[reply]

Merge. In my opinion, "applied statistics" is a redundant phrase. To me it appears that statistics are often applied somehow. So, the article can be merged as a new section or integrated into this article. — Chris53516 ^(Talk) 20:27, 6 February 2007 (UTC)[reply]

I am not a statistician and cannot really comment on the material, so I won't "formally" vote. But the long-standing stubbiness and infrequent editing suggest a merge to me. I'd add that Mathematical statistics is similarly meager, covering nothing that isn't already covered here. Joshua R. Davis 13:54, 8 February 2007 (UTC)[reply]

Merge. I don't quite agree with Chris53516 when he asserts that "applied statistics" is pleonastic, but this article already covers the distinction between "applied statistics" and "theoretical statistics" adequately, in the introduction. I looked through the applied statistics article carefully, and in my opinion a merger is overkill. Applied statistics should simply be deleted. DavidCBryant 15:48, 8 February 2007 (UTC)[reply]

(Note. If someone deletes the page, be sure to redirect it to this article. — Chris53516 ^(Talk) 16:00, 8 February 2007 (UTC))[reply]

Having heard no objections, I have gone ahead and changed Applied statistics into a redirect page. Don't give up on Mathematical statistics quite yet, though. I'm trying to get hold of Dcljr, who had quite a few ideas on that score. I'm sure the theoretical article can be turned into something better pretty soon. DavidCBryant 01:50, 14 February 2007 (UTC)[reply]

misconceptions

not a statistician here but maybe the article ought to have a section addressing those. statistical mechanics has nothing to do with mathematical statistics. many areas are related to rigorous formulation of statistical mechanics: probability and analysis, topology, number theory, etc., but not statistics. i also removed the reference to "sports statistics". to call computing, say, slugging percentages or ERA's or free throw percentages doing statistics seems rather abhorrent, IMHO. Mct mht 07:09, 10 February 2007 (UTC)[reply]

Thanks for taking those (See also) links out. I concur with your decisions. Do you mean to tell me that Maxwell and Boltzmann aren't just two guys who played for the Yankees? ;^> DavidCBryant 12:36, 10 February 2007 (UTC)[reply]

who's on first base, Dave? :-) Mct mht 07:26, 11 February 2007 (UTC)[reply]

I don't mind losing "statistical mechanics", but in my view removing "sports statistics" is going too far. Sure, the routine collection of free throw percentages etc is not exactly groundbreaking statistical work, but it is a (small) part of statistics. I've seen several articles on aspects of sports statistics in reputable statistical journals. They're admittedly more common in lighter fare (e.g. the ASA's Chance magazine has a regular column titled A Statistician Reads the Sports Pages), but they demonstrate that professional statisticians view sports statistics as within their ambit. -- Avenue 03:21, 11 February 2007 (UTC)[reply]

i am certainly in no position to object if that's the concensus of professional statisticians. Mct mht 07:26, 11 February 2007 (UTC)[reply]

Statistical mechanics is indeed probabilistic mechanics, but I'd be inclined to leave the link here. Sports statistics, as pointed out, is deeper than people may realize. (There was a great article on this in the WSJ around August or Sept. of last year.) There is legitimate inferential statistics going on there, e.g. attempts to correct for the effects of luck on a player's stats. JJL 03:47, 11 February 2007 (UTC)[reply]

I don't much care if sports statistics are listed in this article. At least they're comparable (in quantity) to the other kinds of data regular statisticians deal with. But let's keep the references to physics out of the "see also" list ... the meaning of "statistics" in the context of physics and thermodynamics is substantially different from the meaning this article deals with. I guess I could say I use a result from statistical mechanics (a measurement of the ambient temperature) to "make an informed decision" (whether to wear a flannel shirt, or not). But that really seems like stretching the point, to me. Oh – what's on second, and who's on third. ;^> DavidCBryant 17:25, 11 February 2007 (UTC)[reply]

Statistics and Accuracy

Can an expert out there please discuss the topic of statistics and accuracy. For example, do statistics HAVE to be accurate? Or can statistics be a general indication of a trend, reality, etc.

In general, the data from which statistics are derived are as accurate as the observers/experimenters/statisticians can make them. I suppose that observational errors are possible (I might think the lights are off when they're really on ... maybe I just went blind, and haven't realized that yet), but in practice observational errors are fairly rare, and easily controlled.

Even though the observations are accurate, the statistics themselves may be imprecise. In general, the larger the number of observations that can be made, the more precise the statistical estimates that emerge. This tendency of the collected data in a small sample to diverge somewhat from the true characteristics of a sampled population is analyzed, in the first instance, by the statistical variance of the data collected.

Notice that certain kinds of data (mostly relating to people's opinions, and similar subjective measurements) are inherently less reliable than the measurements that can be made in fields like chemistry and physics. Such data can easily be manipulated to reach misleading conclusions, no matter how carefully statistical procedures are carried out (for example, by asking biased questions, or by limiting the allowed responses on a questionnaire, etc.) DavidCBryant 04:33, 10 August 2007 (UTC)[reply]

Archives