Jump to content

Correlation coefficient: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
2 col ref layout
 
(32 intermediate revisions by 21 users not shown)
Line 1: Line 1:
{{short description|Numerical measure of a statistical relationship between variables}}
{{short description|Numerical measure of a statistical relationship between variables}}
A '''correlation coefficient''' is a [[numerical measure]] of some type of [[correlation and dependence|correlation]], meaning a statistical relationship between two [[variable (mathematics)|variables]].<ref>{{cite web |url=http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |title=correlation coefficient |author=<!--Not stated--> |website=NCME.org |publisher=[[National Council on Measurement in Education]] |access-date=April 17, 2014 |quote=correlation coefficient: A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship. |archive-url=https://web.archive.org/web/20170722194028/http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |archive-date=July 22, 2017 |url-status=dead }}</ref> The variables may be two [[column (database)|column]]s of a given [[data set]] of observations, often called a [[sample (statistics)|sample]], or two components of a [[multivariate random variable]] with a known [[distribution (statistics)|distribution]].{{citation needed|date=July 2019}}
A '''correlation coefficient''' is a [[numerical measure]] of some type of '''linear''' [[correlation and dependence|correlation]], meaning a statistical relationship between two [[variable (mathematics)|variables]].{{efn|Correlation coefficient: A [[statistic]] used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.<ref>{{cite web |url=http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |title=correlation coefficient |author=<!--Not stated--> |website=NCME.org |publisher=[[National Council on Measurement in Education]] |access-date=April 17, 2014 |archive-url=https://web.archive.org/web/20170722194028/http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorC |archive-date=July 22, 2017 |url-status=dead}}</ref>}} The variables may be two [[column (database)|column]]s of a given [[data set]] of observations, often called a [[sample (statistics)|sample]], or two components of a [[multivariate random variable]] with a known [[distribution (statistics)|distribution]].{{citation needed|date=July 2019}}


Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible agreement and 0 the strongest possible disagreement.<ref>{{cite book |last1=Taylor |first1=John R. |title=An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements |date=1997 |publisher=University Science Books |location=Sausalito, CA |isbn=0-935702-75-X |page=217 |edition=2nd |url=http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |accessdate=14 February 2019}}</ref> As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by [[outliers]] and the possibility of incorrectly being used to infer a [[causal relationship]] between the variables.<ref name="Boddy">{{cite book|last1=Boddy|first1=Richard |last2=Smith|first2=Gordon |title=Statistical methods in practice: for scientists and technologists |date=2009|publisher=Wiley|location=Chichester, U.K.|isbn=978-0-470-74664-6|pages=95–96}}</ref>
Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation.<ref>{{cite book |last1=Taylor |first1=John R. |title=An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements |date=1997 |publisher=University Science Books |location=Sausalito, CA |isbn=0-935702-75-X |page=217 |edition=2nd |url=http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |access-date=14 February 2019 |archive-url=https://web.archive.org/web/20190215050550/http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |archive-date=15 February 2019 |url-status=dead }}</ref> As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by [[outliers]] and the possibility of incorrectly being used to infer a [[causal relationship]] between the variables (for more, see [[Correlation does not imply causation]]).<ref name="Boddy">{{cite book |last1=Boddy |first1=Richard |last2=Smith |first2=Gordon |title=Statistical Methods in Practice: For scientists and technologists |date=2009 |publisher=Wiley |location=Chichester, U.K. |isbn=978-0-470-74664-6 |pages=95–96}}</ref>


==Types==
==Types==
There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, [[Ordinal data|ordinal]], or [[Categorical data|categorical]].

=== Pearson ===
=== Pearson ===
The [[Pearson product-moment correlation coefficient]], also known as ''r'', ''R'', or Pearson's ''r'', is a measure of the strength and direction of the linear relationship between two variables that is defined as the [[covariance]] of the variables divided by the product of their standard deviations. This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.
The [[Pearson product-moment correlation coefficient]], also known as {{mvar|r}}, {{mvar|R}}, or ''Pearson's''&nbsp;{{mvar|r}}, is a measure of the strength and direction of the ''linear'' relationship between two variables that is defined as the [[covariance]] of the variables divided by the product of their standard deviations.<ref>{{Cite web|last=Weisstein|first=Eric W.|title=Statistical Correlation|url=https://mathworld.wolfram.com/StatisticalCorrelation.html|access-date=2020-08-22|website=mathworld.wolfram.com|language=en}}</ref> This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.


=== Intra-class ===
=== Intra-class ===
[[Intraclass correlation]] (ICC) is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups; it describes how strongly units in the same group resemble each other.
[[Intraclass correlation]] (ICC) is a descriptive statistic that can be used, when quantitative measurements are made on units that are organized into groups; it describes how strongly units in the same group resemble each other.


=== Rank ===
=== Rank ===
[[Rank correlation]] is a measure of the relationship between the rankings of two variables or two rankings of the same variable:
[[Rank correlation]] is a measure of the relationship between the rankings of two variables, or two rankings of the same variable:
*[[Spearman's rank correlation coefficient]] is a measure of how well the relationship between two variables can be described by a monotonic function.
*[[Spearman's rank correlation coefficient]] is a measure of how well the relationship between two variables can be described by a monotonic function.
*The [[Kendall tau rank correlation coefficient]] is a measure of the portion of ranks that match between two data sets.
*The [[Kendall tau rank correlation coefficient]] is a measure of the portion of ranks that match between two data sets.
*[[Goodman and Kruskal's gamma]] is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level.
*[[Goodman and Kruskal's gamma]] is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level.


=== Tetrachoric and Polychoric ===
=== Tetrachoric and polychoric ===


The [[polychoric correlation]] coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if (1) the two variables were measured on a continuous scale, instead of as ordered-category variables, and (2) the two continuous variables followed a [[multivariate normal distribution|bivariate normal distribution]]. When both variables are dichotomous instead of ordered-categorical, the [[polychoric correlation]] coefficient is called the tetrachoric correlation coefficient.
The [[polychoric correlation]] coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:
# The two variables were measured on a continuous scale, instead of as ordered-category variables.
# The two continuous variables followed a [[multivariate normal distribution|bivariate normal distribution]].
When both variables are [[Dichotomous variable|dichotomous]] instead of ordered-categorical, the [[polychoric correlation]] coefficient is called the tetrachoric correlation coefficient.

===Interpreting correlation coefficient values===

The correlation between two variables have different associations that are measured in values such as {{mvar|r}} or {{mvar|R}}. Correlation values range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation between variables.<ref>{{cite book |last1=Taylor |first1=John R. |title=An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements |date=1997 |publisher=University Science Books |location=Sausalito, CA |isbn=0-935702-75-X |page=217 |edition=2nd |url=http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |access-date=14 February 2019 |archive-url=https://web.archive.org/web/20190215050550/http://faculty.kfupm.edu.sa/phys/aanaqvi/Taylor-An%20Introduction%20to%20Error%20Analysis.pdf |archive-date=15 February 2019 |url-status=dead }}</ref>

{| class="wikitable"
|-
! {{mvar|r}} or {{mvar|R}} !! {{mvar|r}} or {{mvar|R}} !! Strength or weakness of association between variables<ref>{{cite web |title=The Correlation Coefficient (r) |url=https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module9-Correlation-Regression/PH717-Module9-Correlation-Regression4.html |website=Boston University}}</ref>
|-
| +1.0 to +0.8 || -1.0 to -0.8 || Perfect or very strong association
|-
| +0.8 to +0.6 || -0.8 to -0.6 || Strong association
|-
| +0.6 to +0.4 || -0.6 to -0.4 || Moderate association
|-
| +0.4 to +0.2 || -0.4 to -0.2 || Weak association
|-
| +0.2 to 0.0 || -0.2 to 0.0 || Very weak or no association
|}


==See also==
==See also==
*[[Correlation disattenuation]]
*[[Coefficient of determination]]
*[[Correlation and dependence]]
*[[Correlation ratio]]
*[[Distance correlation]]
*[[Distance correlation]]
*[[Goodness of fit]], any of several measures that measure how well a statistical model fits observations by summarizing the discrepancy between observed values and the values expected under the model
*[[Goodness of fit]], any of several measures that measure how well a statistical model fits observations by summarizing the discrepancy between observed values and the values expected under the model
*[[Multiple correlation]]
*[[Coefficient of determination]]
*[[Partial correlation]]
*[[Partial correlation]]

==Notes==
{{notelist|1}}


==References==
==References==
Line 33: Line 66:
{{Portal bar|Mathematics}}
{{Portal bar|Mathematics}}


[[Category:Correlation indicators]]
[[Category:Mathematical terminology]]
[[Category:Mathematical terminology]]
[[Category:Covariance and correlation]]

Latest revision as of 20:58, 28 November 2024

A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables.[a] The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.[citation needed]

Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation.[2] As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by outliers and the possibility of incorrectly being used to infer a causal relationship between the variables (for more, see Correlation does not imply causation).[3]

Types

[edit]

There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, ordinal, or categorical.

Pearson

[edit]

The Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength and direction of the linear relationship between two variables that is defined as the covariance of the variables divided by the product of their standard deviations.[4] This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.

Intra-class

[edit]

Intraclass correlation (ICC) is a descriptive statistic that can be used, when quantitative measurements are made on units that are organized into groups; it describes how strongly units in the same group resemble each other.

Rank

[edit]

Rank correlation is a measure of the relationship between the rankings of two variables, or two rankings of the same variable:

Tetrachoric and polychoric

[edit]

The polychoric correlation coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:

  1. The two variables were measured on a continuous scale, instead of as ordered-category variables.
  2. The two continuous variables followed a bivariate normal distribution.

When both variables are dichotomous instead of ordered-categorical, the polychoric correlation coefficient is called the tetrachoric correlation coefficient.

Interpreting correlation coefficient values

[edit]

The correlation between two variables have different associations that are measured in values such as r or R. Correlation values range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation between variables.[5]

r or R r or R Strength or weakness of association between variables[6]
+1.0 to +0.8 -1.0 to -0.8 Perfect or very strong association
+0.8 to +0.6 -0.8 to -0.6 Strong association
+0.6 to +0.4 -0.6 to -0.4 Moderate association
+0.4 to +0.2 -0.4 to -0.2 Weak association
+0.2 to 0.0 -0.2 to 0.0 Very weak or no association

See also

[edit]

Notes

[edit]
  1. ^ Correlation coefficient: A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.[1]

References

[edit]
  1. ^ "correlation coefficient". NCME.org. National Council on Measurement in Education. Archived from the original on July 22, 2017. Retrieved April 17, 2014.
  2. ^ Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN 0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
  3. ^ Boddy, Richard; Smith, Gordon (2009). Statistical Methods in Practice: For scientists and technologists. Chichester, U.K.: Wiley. pp. 95–96. ISBN 978-0-470-74664-6.
  4. ^ Weisstein, Eric W. "Statistical Correlation". mathworld.wolfram.com. Retrieved 2020-08-22.
  5. ^ Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN 0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
  6. ^ "The Correlation Coefficient (r)". Boston University.