Jump to content

Statistical significance: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Slb (talk | contribs)
Slb (talk | contribs)
Line 17: Line 17:
==Relation to Null Hypothesis Testing==
==Relation to Null Hypothesis Testing==


Null hypothesis testing is an adaptation of [[reductio ad absurdum]] argument used in statistics. In essence, a claim is shown to be valid by demonstrating the improbability of the counter-claim that follows from its denial. As such, the only hypothesis <math>H</math> which embodies the counter-claim and which needs to be specified in this test is referred to as the null hypothesis. The rejection of null hypothesis suggests that the correct hypothesis is among the logical complements of <math>H</math>.<ref name="Christensen, 2005">Christensen, Ronald (2005). [http://www.stat.ualberta.ca/~wiens/stat665/TAS%20-%20testing.pdf Testing Fisher, Neyman-Pearson, and Bayes].'' The American Statistician, May 2005, Vol. 59, No. 2.'' American Statistical Association. DOI: 10.1198/000313005X20871</ref> For instance, in the earlier coin tossing example, the rejection of the hypothesis that the coin is unbiased (i.e. <math>p=1/2</math>), leads us to accept its logical complement that the coin is indeed biased (i.e. <math>p \neq 1/2</math>).
Null hypothesis testing is an adaptation of [[reductio ad absurdum]] argument used in statistics. In essence, a claim is shown to be valid by demonstrating the improbability of the counter-claim that follows from its denial. As such, the only hypothesis <math>H</math> which embodies the counter-claim and which needs to be specified in this test is referred to as the null hypothesis. The rejection of null hypothesis suggests that the correct hypothesis is among the logical complements of <math>H</math><ref name="Christensen, 2005">Christensen, Ronald (2005). [http://www.stat.ualberta.ca/~wiens/stat665/TAS%20-%20testing.pdf Testing Fisher, Neyman-Pearson, and Bayes].'' The American Statistician, May 2005, Vol. 59, No. 2.'' American Statistical Association. DOI: 10.1198/000313005X20871</ref>. For instance, in the earlier coin tossing example, the rejection of the hypothesis that the coin is unbiased (i.e. <math>p=1/2</math>), leads us to accept its logical complement that the coin is indeed biased (i.e. <math>p \neq 1/2</math>).


Generally, the null hypothesis refers to a general or default position, intending to mean that an experiment will produce null result. That is, an experiment will not produce anything of out of ordinary. However, when used specifically, null result depends on the particular scientific hypothesis and its implications under consideration. In an experimental setting, the null effect can be studied using a "control group". The invalidation of the null hypothesis allows a researcher to conclude that the experiment has discovered something out of ordinary. An experimental test needs to be specified in statistical language prior to the analysis of the experimental data. The calculated statistical significance of a result is in principle only valid if the statistical null hypothesis was specified before any data were examined. If, instead, the statistical hypothesis was specified after some of the data were examined, and specifically tuned to match the direction in which the early data appeared to point, the calculation would overestimate statistical significance.
Generally, the null hypothesis refers to a general or default position, intending to mean that an experiment will produce null result. That is, an experiment will not produce anything of out of ordinary. However, when used specifically, null result depends on the particular scientific hypothesis and its implications under consideration. In an experimental setting, the null effect can be studied using a "control group". The invalidation of the null hypothesis allows a researcher to conclude that the experiment has discovered something out of ordinary. An experimental test needs to be specified in statistical language prior to the analysis of the experimental data. The calculated statistical significance of a result is in principle only valid if the statistical null hypothesis was specified before any data were examined. If, instead, the statistical hypothesis was specified after some of the data were examined, and specifically tuned to match the direction in which the early data appeared to point, the calculation would overestimate statistical significance.

Revision as of 09:39, 24 November 2013

Statistical significance is an assessment of whether observations reflect a pattern rather than just chance. It is an important part of the process in statistical hypothesis testing. In statistics, a result is considered significant not because it is important or meaningful, but because it has been predicted as unlikely to have occurred by chance alone.

The concept of statistical significance originated from Ronald Fisher, who coined the phrase "test of significance" to describe statistical hypothesis tests[1] These tests are used to determine which outcomes of a study would lead to a rejection of the null hypothesis based on a pre-specified low probability threshold called p-values, which can help an investigator to decide if a result contains sufficient information to cast doubt on the null hypothesis.

P-values are often coupled to a reference probability or significance level (also called level), which is also set ahead of time, usually at 0.05. Thus, if the p-value was found to be less than 0.05, then the results would be considered statistically significant and the null hypothesis would be rejected.

History

The phrase test of significance was coined by Ronald Fisher.[1] The term significance, used in a statistical sense, dates back to 1885.[2]

Relation to Null Hypothesis Testing

Null hypothesis testing is an adaptation of reductio ad absurdum argument used in statistics. In essence, a claim is shown to be valid by demonstrating the improbability of the counter-claim that follows from its denial. As such, the only hypothesis which embodies the counter-claim and which needs to be specified in this test is referred to as the null hypothesis. The rejection of null hypothesis suggests that the correct hypothesis is among the logical complements of [3]. For instance, in the earlier coin tossing example, the rejection of the hypothesis that the coin is unbiased (i.e. ), leads us to accept its logical complement that the coin is indeed biased (i.e. ).

Generally, the null hypothesis refers to a general or default position, intending to mean that an experiment will produce null result. That is, an experiment will not produce anything of out of ordinary. However, when used specifically, null result depends on the particular scientific hypothesis and its implications under consideration. In an experimental setting, the null effect can be studied using a "control group". The invalidation of the null hypothesis allows a researcher to conclude that the experiment has discovered something out of ordinary. An experimental test needs to be specified in statistical language prior to the analysis of the experimental data. The calculated statistical significance of a result is in principle only valid if the statistical null hypothesis was specified before any data were examined. If, instead, the statistical hypothesis was specified after some of the data were examined, and specifically tuned to match the direction in which the early data appeared to point, the calculation would overestimate statistical significance.

Relation to p-values

A statistical hypothesis is defined as the probability distribution that is assumed to govern the observed data. If is the observed data and is the statistical hypothesis under consideration, then the Fisher's statistical significance is given by the conditional probability which gives the likelihood of the observation if the hypothesis is assumed to be true. If this conditional probability is very small, this means that either (1) we admit that a very rare event has occurred if we assume our hypothesis to be true, or (2) our hypothesis may not explain the observation adequately and that an alternative hypothesis might be needed to explain the observed data. When used in statistics, the word significant does not mean important or meaningful, as it does in everyday speech: with sufficient data, a statistically significant result may be very small in magnitude.

For example, tossing a coin 3 times and obtaining 3 heads would not be considered an extreme result. However, tossing a coin 10 times and finding that all 10 tosses land the same way up would be considered an extreme result. Let us suppose that our hypothesis, is that the coin is fair, i.e., the probability of landing head . From this hypothesis, it follows that the probability that we get all heads in 10 tosses is

which is rare. The result may be considered statistically significant evidence that our null hypothesis cannot explain the observed data and can therefore be rejected.

If is a continuous random variable, and we observed an instance , then Thus we need to change the definition to accommodate the continuous random variables. Usually, instead of the actual observations, is instead a test statistic. A test statistic is a scalar function of all the observations. Thus the p-value is defined as the probability, under the assumption of hypothesis , of obtaining a result equal to or more extreme than what was actually observed. Depending on how we look at it, the "more extreme than what was actually observed" can either mean (right tail event) or (left tail event) or the "smaller" of and (double tailed event). Thus the test of significance as given by the p-value is

  • for right tail event,
  • for left tail event,
  • for double tail event.

The hypothesis is rejected if any of these probabilities is less than or equal to the level of significance .

The test statistic follows a distribution determined by the function used to define that test statistic and the distribution of the observational data. For the important case where the data are hypothesized to follow the normal distribution, depending on the nature of the test statistic, and thus our underlying hypothesis of the test statistic, different null hypothesis tests have been developed. Some such tests are z-test for normal distribution, t-test for Student's t-distribution, f-test for f-distribution. When the data do not follow a normal distribution, it can still be possible to approximate the distribution of these tests statistics by a normal distribution by invoking the central limit theorem for large samples, as in the case of Pearson's chi-squared test.

Sample size

Researchers focusing solely on whether individual test results are significant or not may miss important response patterns which individually fall under the threshold set for tests of significance. Therefore along with tests of significance, it is preferable to examine effect-size statistics, which describe how large the effect is and the uncertainty around that estimate, so that the practical importance of the effect may be gauged by the reader.

Use in practice

Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of significance gives a p-value lower than or equal to the significance level,[4] the null hypothesis is rejected at that level. Such results are informally referred to as 'statistically significant (at the p = 0.05 level, etc.)'. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence", a 0.001 level of statistical significance is being stated. The lower the significance level chosen, the stronger the evidence required. The choice of significance level is somewhat arbitrary, but for many applications, a level of 5% is chosen by convention.[5][6]

In some situations it is convenient to express the complementary statistical significance (so 0.95 instead of 0.05), which corresponds to a quantile of the test statistic. In general, when interpreting a stated significance, one must be careful to note what, precisely, is being tested statistically.

Different levels of cutoff trade off countervailing effects. Lower levels – such as 0.01 instead of 0.05 – are stricter, and increase confidence in the determination of significance, but run an increased risk of accepting a false null hypothesis. Evaluation of a given p-value of data requires a degree of judgment, and rather than a strict cutoff, one may instead simply consider lower p-values as more significant.

Graphically, statistical significance is often indicated by the use of asterisks (*). The number of asterisks usually indicates the significance level: * for 0.05, ** for 0.01, and *** for 0.001 or 0.005. These symbols may also be used in diagrams, such as bar charts, to indicate a significant effect, such as a significant difference in the mean value between two populations (e.g. here).

In terms of σ (sigma)

In some fields, for example nuclear and particle physics, it is common to express statistical significance in units of the standard deviation σ of a normal distribution. A statistical significance of "" can be converted into a p-value by use of the cumulative distribution function Φ of the standard normal distribution, through the relation:

(this formula varies depending on whether a one-tailed or a two-tailed test is appropriate)

or via use of the error function:

Tabulated values of these functions are often found in statistics text books: see standard normal table. The use of σ implicitly assumes a normal distribution of measurement values. For example, if a theory predicts that a parameter has a value of, say, 109 ± 3, and the parameter measures 100, then one might report the measurement as a "3σ deviation" from the theoretical prediction. In terms of p-value, this statement is equivalent to saying that "assuming the theory is true, the likelihood of obtaining the experimental result by coincidence is 0.27%" (since 1 − erf(3/√2) = 0.0027) (again depending on whether a one-tailed test or two-tailed test is appropriate).

Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern practice is to quote the p-value explicitly, where the outcome of a test is essentially the final outcome of an experiment or other study. And, importantly, it should be stated whether the p-value is judged significant. This allows transferring the maximum information from a summary of the study into meta-analyses.

Pitfalls and criticism

The scientific literature contains extensive discussion of the concept of statistical significance and in particular of its potential misuse and abuse and misunderstandings.

Signal–noise ratio conceptualisation of significance

Statistical significance can be considered the confidence one has in a given result. In a comparison study, it is dependent on the relative difference between the groups compared, the amount of measurement and the noise associated with the measurement. In other words, the confidence one has in a given result being non-random (i.e., it is not a consequence of chance) depends on the signal-to-noise ratio (SNR) and the sample size.

Expressed mathematically, the confidence that a result is not by random chance is given by the following formula by Sackett:[7]

For clarity, the above formula is presented in tabular form below.

Dependence of confidence with noise, signal and sample size (tabular form)

Parameter Parameter increases Parameter decreases
Noise Confidence decreases Confidence increases
Signal Confidence increases Confidence decreases
Sample size Confidence increases Confidence decreases

In words, the dependence of confidence is high if the noise is low and/or the sample size is large and/or the effect size (signal) is large. The confidence of a result (and its associated confidence interval) is not dependent on effect size alone. If the sample size is large and the noise is low a small effect size can be measured with great confidence. Whether a small effect size is considered important is dependent on the context of the events compared.

In medicine, small effect sizes (reflected by small increases of risk) are often considered clinically relevant and are frequently used to guide treatment decisions if there is great confidence in them. Whether a given treatment is considered a worthy endeavour is dependent on the risks, benefits and costs.[citation needed]

See also

References

  1. ^ a b R. A. Fisher (1925).Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p.43. Cite error: The named reference "Fisher1925" was defined multiple times with different content (see the help page).
  2. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1511/2013.100.6, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1511/2013.100.6 instead.
  3. ^ Christensen, Ronald (2005). Testing Fisher, Neyman-Pearson, and Bayes. The American Statistician, May 2005, Vol. 59, No. 2. American Statistical Association. DOI: 10.1198/000313005X20871
  4. ^ Fisher RA (1926). "The arrangement of field experiments". Journal of the Ministry of Agriculture. 33: 504. {{cite journal}}: line feed character in |journal= at position 8 (help)
  5. ^ Stigler 2008.
  6. ^ Fisher 1925.
  7. ^ Sackett DL (2001). "Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!)". CMAJ. 165 (9): 1226–37. PMC 81587. PMID 11706914. {{cite journal}}: Unknown parameter |month= ignored (help)
  • Template:Cite isbn
  • Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1007/s00144-008-0033-3, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1007/s00144-008-0033-3 instead.

Further reading