Jump to content

Correlation coefficient: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
General revision throughout the page. Improved citations. Changes in phrasing for better clarity and flow. Removed duplicate spaces. Broken down lengthy sentences. Expanded and alphabeticalized See Also. Reformatted an inline list into bullet points.
 
(26 intermediate revisions by 16 users not shown)
Line 1: Line 1:
{{short description|Numerical measure of a statistical relationship between variables}}
{{short description|Numerical measure of a statistical relationship between variables}}
A '''correlation coefficient''' is a [[numerical measure]] of some type of [[correlation and dependence|correlation]], meaning a statistical relationship between two [[variable (mathematics)|variables]].{{efn|Correlation coefficient: A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.<ref>{{cite web |url=http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |title=correlation coefficient |author=<!--Not stated--> |website=NCME.org |publisher=[[National Council on Measurement in Education]] |access-date=April 17, 2014 |archive-url=https://web.archive.org/web/20170722194028/http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |archive-date=July 22, 2017 |url-status=dead}}</ref>}} The variables may be two [[column (database)|column]]s of a given [[data set]] of observations, often called a [[sample (statistics)|sample]], or two components of a [[multivariate random variable]] with a known [[distribution (statistics)|distribution]].{{citation needed|date=July 2019}}
A '''correlation coefficient''' is a [[numerical measure]] of some type of '''linear''' [[correlation and dependence|correlation]], meaning a statistical relationship between two [[variable (mathematics)|variables]].{{efn|Correlation coefficient: A [[statistic]] used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.<ref>{{cite web |url=http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |title=correlation coefficient |author=<!--Not stated--> |website=NCME.org |publisher=[[National Council on Measurement in Education]] |access-date=April 17, 2014 |archive-url=https://web.archive.org/web/20170722194028/http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |archive-date=July 22, 2017 |url-status=dead}}</ref>}} The variables may be two [[column (database)|column]]s of a given [[data set]] of observations, often called a [[sample (statistics)|sample]], or two components of a [[multivariate random variable]] with a known [[distribution (statistics)|distribution]].{{citation needed|date=July 2019}}


Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible agreement and 0 the strongest possible disagreement.<ref>{{cite book |last1=Taylor |first1=John R. |title=An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements |date=1997 |publisher=University Science Books |location=Sausalito, CA |isbn=0-935702-75-X |page=217 |edition=2nd |url=http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |accessdate=14 February 2019 |archive-url=https://web.archive.org/web/20190215050550/http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |archive-date=15 February 2019 |url-status=dead }}</ref> As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by [[outliers]] and the possibility of incorrectly being used to infer a [[causal relationship]] between the variables (for more, see [[Correlation does not imply causation]]).<ref name="Boddy">{{cite book |last1=Boddy |first1=Richard |last2=Smith |first2=Gordon |title=Statistical Methods in Practice: For scientists and technologists |date=2009 |publisher=Wiley |location=Chichester, U.K. |isbn=978-0-470-74664-6 |pages=95–96}}</ref>
Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation.<ref>{{cite book |last1=Taylor |first1=John R. |title=An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements |date=1997 |publisher=University Science Books |location=Sausalito, CA |isbn=0-935702-75-X |page=217 |edition=2nd |url=http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |access-date=14 February 2019 |archive-url=https://web.archive.org/web/20190215050550/http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |archive-date=15 February 2019 |url-status=dead }}</ref> As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by [[outliers]] and the possibility of incorrectly being used to infer a [[causal relationship]] between the variables (for more, see [[Correlation does not imply causation]]).<ref name="Boddy">{{cite book |last1=Boddy |first1=Richard |last2=Smith |first2=Gordon |title=Statistical Methods in Practice: For scientists and technologists |date=2009 |publisher=Wiley |location=Chichester, U.K. |isbn=978-0-470-74664-6 |pages=95–96}}</ref>


==Types==
==Types==
There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, ordinal, or categorical.
There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, [[Ordinal data|ordinal]], or [[Categorical data|categorical]].


=== Pearson ===
=== Pearson ===
The [[Pearson product-moment correlation coefficient]], also known as {{mvar|r}}, {{mvar|R}}, or ''Pearson's''&nbsp;{{mvar|r}}, is a measure of the strength and direction of the linear relationship between two variables that is defined as the [[covariance]] of the variables divided by the product of their standard deviations.<ref>{{Cite web|date=2020-04-26|title=List of Probability and Statistics Symbols|url=https://mathvault.ca/hub/higher-math/math-symbols/probability-statistics-symbols/|access-date=2020-08-22|website=Math Vault|language=en-US}}</ref><ref>{{Cite web|last=Weisstein|first=Eric W.|title=Statistical Correlation|url=https://mathworld.wolfram.com/StatisticalCorrelation.html|access-date=2020-08-22|website=mathworld.wolfram.com|language=en}}</ref> This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.
The [[Pearson product-moment correlation coefficient]], also known as {{mvar|r}}, {{mvar|R}}, or ''Pearson's''&nbsp;{{mvar|r}}, is a measure of the strength and direction of the ''linear'' relationship between two variables that is defined as the [[covariance]] of the variables divided by the product of their standard deviations.<ref>{{Cite web|last=Weisstein|first=Eric W.|title=Statistical Correlation|url=https://mathworld.wolfram.com/StatisticalCorrelation.html|access-date=2020-08-22|website=mathworld.wolfram.com|language=en}}</ref> This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.


=== Intra-class ===
=== Intra-class ===
Line 19: Line 19:
*[[Goodman and Kruskal's gamma]] is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level.
*[[Goodman and Kruskal's gamma]] is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level.


=== Tetrachoric and Polychoric ===
=== Tetrachoric and polychoric ===


The [[polychoric correlation]] coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:
The [[polychoric correlation]] coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:
Line 26: Line 26:
# The two continuous variables followed a [[multivariate normal distribution|bivariate normal distribution]].
# The two continuous variables followed a [[multivariate normal distribution|bivariate normal distribution]].


When both variables are dichotomous instead of ordered-categorical, the [[polychoric correlation]] coefficient is called the tetrachoric correlation coefficient.
When both variables are [[Dichotomous variable|dichotomous]] instead of ordered-categorical, the [[polychoric correlation]] coefficient is called the tetrachoric correlation coefficient.

===Interpreting correlation coefficient values===

The correlation between two variables have different associations that are measured in values such as {{mvar|r}} or {{mvar|R}}. Correlation values range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation between variables.<ref>{{cite book |last1=Taylor |first1=John R. |title=An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements |date=1997 |publisher=University Science Books |location=Sausalito, CA |isbn=0-935702-75-X |page=217 |edition=2nd |url=http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |access-date=14 February 2019 |archive-url=https://web.archive.org/web/20190215050550/http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |archive-date=15 February 2019 |url-status=dead }}</ref>

{| class="wikitable"
|-
! {{mvar|r}} or {{mvar|R}} !! {{mvar|r}} or {{mvar|R}} !! Strength or weakness of association between variables<ref>{{cite web |title=The Correlation Coefficient (r) |url=https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module9-Correlation-Regression/PH717-Module9-Correlation-Regression4.html |website=Boston University}}</ref>
|-
| +1.0 to +0.8 || -1.0 to -0.8 || Perfect or very strong association
|-
| +0.8 to +0.6 || -0.8 to -0.6 || Strong association
|-
| +0.6 to +0.4 || -0.6 to -0.4 || Moderate association
|-
| +0.4 to +0.2 || -0.4 to -0.2 || Weak association
|-
| +0.2 to 0.0 || -0.2 to 0.0 || Very weak or no association
|}


==See also==
==See also==
*[[Correlation disattenuation]]
*[[Coefficient of determination]]
*[[Coefficient of determination]]
*[[Correlation and dependence]]
*[[Correlation and dependence]]
Line 34: Line 54:
*[[Distance correlation]]
*[[Distance correlation]]
*[[Goodness of fit]], any of several measures that measure how well a statistical model fits observations by summarizing the discrepancy between observed values and the values expected under the model
*[[Goodness of fit]], any of several measures that measure how well a statistical model fits observations by summarizing the discrepancy between observed values and the values expected under the model
*Multiple correlation
*[[Multiple correlation]]
*[[Partial correlation]]
*[[Partial correlation]]


==Footnotes==
==Notes==
{{notelist|1}}
{{notelist|1}}


Line 46: Line 66:
{{Portal bar|Mathematics}}
{{Portal bar|Mathematics}}


[[Category:Correlation indicators]]
[[Category:Mathematical terminology]]
[[Category:Mathematical terminology]]
[[Category:Covariance and correlation]]

Latest revision as of 20:58, 28 November 2024

A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables.[a] The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.[citation needed]

Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation.[2] As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by outliers and the possibility of incorrectly being used to infer a causal relationship between the variables (for more, see Correlation does not imply causation).[3]

Types

[edit]

There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, ordinal, or categorical.

Pearson

[edit]

The Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength and direction of the linear relationship between two variables that is defined as the covariance of the variables divided by the product of their standard deviations.[4] This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.

Intra-class

[edit]

Intraclass correlation (ICC) is a descriptive statistic that can be used, when quantitative measurements are made on units that are organized into groups; it describes how strongly units in the same group resemble each other.

Rank

[edit]

Rank correlation is a measure of the relationship between the rankings of two variables, or two rankings of the same variable:

Tetrachoric and polychoric

[edit]

The polychoric correlation coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:

  1. The two variables were measured on a continuous scale, instead of as ordered-category variables.
  2. The two continuous variables followed a bivariate normal distribution.

When both variables are dichotomous instead of ordered-categorical, the polychoric correlation coefficient is called the tetrachoric correlation coefficient.

Interpreting correlation coefficient values

[edit]

The correlation between two variables have different associations that are measured in values such as r or R. Correlation values range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation between variables.[5]

r or R r or R Strength or weakness of association between variables[6]
+1.0 to +0.8 -1.0 to -0.8 Perfect or very strong association
+0.8 to +0.6 -0.8 to -0.6 Strong association
+0.6 to +0.4 -0.6 to -0.4 Moderate association
+0.4 to +0.2 -0.4 to -0.2 Weak association
+0.2 to 0.0 -0.2 to 0.0 Very weak or no association

See also

[edit]

Notes

[edit]
  1. ^ Correlation coefficient: A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.[1]

References

[edit]
  1. ^ "correlation coefficient". NCME.org. National Council on Measurement in Education. Archived from the original on July 22, 2017. Retrieved April 17, 2014.
  2. ^ Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN 0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
  3. ^ Boddy, Richard; Smith, Gordon (2009). Statistical Methods in Practice: For scientists and technologists. Chichester, U.K.: Wiley. pp. 95–96. ISBN 978-0-470-74664-6.
  4. ^ Weisstein, Eric W. "Statistical Correlation". mathworld.wolfram.com. Retrieved 2020-08-22.
  5. ^ Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN 0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
  6. ^ "The Correlation Coefficient (r)". Boston University.