Cramér's V: Difference between revisions
m Replace magic links with templates per local RfC and MediaWiki RfC |
|||
(37 intermediate revisions by 26 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Statistical measure of association}} |
|||
In [[statistics]], '''Cramér's V''' (sometimes referred to as '''Cramér's phi''' and denoted as '''φ<sub>''c''</sub>''') is a measure of [[Association (statistics)|association]] between two [[Nominal data#Nominal scale|nominal variables]], giving a value between 0 and +1 (inclusive). It is based on [[Pearson's chi-squared test#Calculating the test-statistic|Pearson's chi-squared statistic]] and was published by [[Harald Cramér]] in 1946.<ref>Cramér, Harald. 1946. ''Mathematical Methods of Statistics''. Princeton: Princeton University Press, page 282 (Chapter 21. The two-dimensional case). {{ISBN|0-691-08004-6}} ([http://press.princeton.edu/TOCs/c391.html table of content])</ref> |
In [[statistics]], '''Cramér's V''' (sometimes referred to as '''Cramér's phi''' and denoted as '''φ<sub>''c''</sub>''') is a measure of [[Association (statistics)|association]] between two [[Nominal data#Nominal scale|nominal variables]], giving a value between 0 and +1 (inclusive). It is based on [[Pearson's chi-squared test#Calculating the test-statistic|Pearson's chi-squared statistic]] and was published by [[Harald Cramér]] in 1946.<ref>Cramér, Harald. 1946. ''Mathematical Methods of Statistics''. Princeton: Princeton University Press, page 282 (Chapter 21. The two-dimensional case). {{ISBN|0-691-08004-6}} ([http://press.princeton.edu/TOCs/c391.html table of content] {{Webarchive|url=https://web.archive.org/web/20160816102234/http://press.princeton.edu/TOCs/c391.html |date=2016-08-16 }})</ref> |
||
==Usage and interpretation== |
==Usage and interpretation== |
||
φ<sub>''c''</sub> is the intercorrelation of two discrete variables<ref name="Ref_a">Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.</ref> and may be used with variables having two or more levels. φ<sub>''c''</sub> is a symmetrical measure |
φ<sub>''c''</sub> is the intercorrelation of two discrete variables<ref name="Ref_a">Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.</ref> and may be used with variables having two or more levels. φ<sub>''c''</sub> is a symmetrical measure: it does not matter which variable we place in the columns and which in the rows. Also, the order of rows/columns does not matter, so φ<sub>''c''</sub> may be used with nominal data types or higher (notably, ordered or numerical). |
||
⚫ | Cramér's V varies from 0 (corresponding to [[Independence (probability theory)|no association]] between the variables) to 1 (complete association) and can reach 1 only when each variable is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation. |
||
Cramér's V may also be applied to [[goodness of fit]] chi-squared models when there is a 1×k table (e.g.: ''r''=1). In this case ''k'' is taken as the number of optional outcomes and it functions as a measure of tendency towards a single outcome. {{citation needed|date=January 2016}} |
|||
⚫ | |||
φ<sub>''c''</sub><sup>2</sup> is the mean square [[canonical correlation]] between the variables.{{citation needed|date=January 2011}} |
φ<sub>''c''</sub><sup>2</sup> is the mean square [[canonical correlation]] between the variables.{{citation needed|date=January 2011}} |
||
In the case of a |
In the case of a 2 × 2 [[contingency table]] Cramér's V is equal to the absolute value of [[Phi coefficient]]. |
||
Note that as chi-squared values tend to increase with the number of cells, the greater the difference between ''r'' (rows) and ''c'' (columns), the more likely φ<sub>c</sub> will tend to 1 without strong evidence of a meaningful correlation.{{Citation needed|date=June 2011}} |
|||
V may be viewed as the association between two variables as a percentage of their maximum possible variation. V<sup>2</sup> is the mean square [[canonical correlation]] between the variables. {{Citation needed|date=March 2015}} |
|||
==Calculation== |
==Calculation== |
||
Line 22: | Line 17: | ||
The chi-squared statistic then is: |
The chi-squared statistic then is: |
||
:<math>\chi^2=\sum_{i,j}\frac{(n_{ij}-\frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}</math> |
:<math>\chi^2=\sum_{i,j}\frac{(n_{ij}-\frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}\;,</math> |
||
where <math>n_{i.}=\sum_jn_{ij}</math> is the number of times the value <math>A_i</math> is observed and <math>n_{.j}=\sum_in_{ij}</math> is the number of times the value <math>B_j</math> is observed. |
|||
Cramér's V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1: |
Cramér's V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1: |
||
:<math>V = \sqrt{\frac{\varphi^2}{\min(k - 1,r-1)}} = \sqrt{ \frac{\chi^2/n}{\min(k - 1,r-1)}}</math> |
:<math>V = \sqrt{\frac{\varphi^2}{\min(k - 1,r-1)}} = \sqrt{ \frac{\chi^2/n}{\min(k - 1,r-1)}}\;,</math> |
||
where: |
where: |
||
* <math>\varphi |
* <math>\varphi</math> is the phi coefficient. |
||
* <math>\chi^2</math> is derived from |
* <math>\chi^2</math> is derived from Pearson's chi-squared test |
||
* <math>n</math> is the grand total of observations and |
* <math>n</math> is the grand total of observations and |
||
* <math>k</math> being the number of columns. |
* <math>k</math> being the number of columns. |
||
Line 38: | Line 35: | ||
The formula for the variance of ''V''=φ<sub>''c''</sub> is known.<ref>Liebetrau, Albert M. (1983). ''Measures of association''. Newbury Park, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No. 32. (pages 15–16)</ref> |
The formula for the variance of ''V''=φ<sub>''c''</sub> is known.<ref>Liebetrau, Albert M. (1983). ''Measures of association''. Newbury Park, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No. 32. (pages 15–16)</ref> |
||
In R, the function <code> |
In R, the function <code>cramerV()</code> from the package <code>rcompanion</code><ref>{{Cite web | url=https://CRAN.R-project.org/package=rcompanion | title=Rcompanion: Functions to Support Extension Education Program Evaluation| date=2019-01-03}}</ref> calculates ''V'' using the chisq.test function from the stats package. In contrast to the function <code>cramersV()</code> from the <code>lsr</code><ref>{{Cite web | url=https://CRAN.R-project.org/package=lsr | title=Lsr: Companion to "Learning Statistics with R"| date=2015-03-02}}</ref> package, <code>cramerV()</code> also offers an option to correct for bias. It applies the correction described in the following section. |
||
==Bias correction== |
==Bias correction== |
||
Line 50: | Line 47: | ||
:<math> \tilde r = r - \frac{(r-1)^2}{n-1} </math> |
:<math> \tilde r = r - \frac{(r-1)^2}{n-1} </math> |
||
Then <math>\tilde V</math> estimates the same population quantity as Cramér's V but with typically much smaller [[mean squared error]]. The rationale for the correction is that under independence, |
Then <math>\tilde V</math> estimates the same population quantity as Cramér's V but with typically much smaller [[mean squared error]]. The rationale for the correction is that under independence, |
||
<math>E\varphi^2=\frac{(k-1)(r-1)}{n-1}</math>.<ref>{{cite journal |last=Bartlett |first=Maurice S. |year=1937 |title=Properties of Sufficiency and Statistical Tests |journal=Proceedings of the Royal Society of London |series=Series A |volume=160 |issue=901 |pages=268–282 |jstor=96803 }}</ref> |
<math>E[\varphi^2]=\frac{(k-1)(r-1)}{n-1}</math>.<ref>{{cite journal |last=Bartlett |first=Maurice S. |year=1937 |title=Properties of Sufficiency and Statistical Tests |journal=Proceedings of the Royal Society of London |series=Series A |volume=160 |issue=901 |pages=268–282 |jstor=96803 |doi=10.1098/rspa.1937.0109 |bibcode=1937RSPSA.160..268B |doi-access= }}</ref> |
||
==See also== |
==See also== |
||
'''Other measures of correlation for nominal data:''' |
'''Other measures of correlation for nominal data:''' |
||
* The [[Percent Maximum Difference (PMD)|Percent Maximum Difference]]<ref>{{Cite journal |last=Tyler |first=Scott R. |last2=Bunyavanich |first2=Supinda |last3=Schadt |first3=Eric E. |date=2021-11-19 |title=PMD Uncovers Widespread Cell-State Erasure by scRNAseq Batch Correction Methods |url=https://www.biorxiv.org/content/10.1101/2021.11.15.468733v1 |journal=BioRxiv |language=en |pages=2021.11.15.468733 |doi=10.1101/2021.11.15.468733}}</ref> |
|||
* The [[phi coefficient]] |
* The [[phi coefficient]] |
||
* [[Tschuprow's T]] |
* [[Tschuprow's T]] |
||
Line 67: | Line 65: | ||
* [[Contingency table]] |
* [[Contingency table]] |
||
* [[Effect size]] |
* [[Effect size]] |
||
* {{slink|Cluster analysis|External evaluation}} |
|||
* [[Cluster_analysis#External_evaluation]] |
|||
==References== |
==References== |
||
Line 73: | Line 71: | ||
==External links== |
==External links== |
||
* [ |
* [https://www.jstor.org/stable/2577276 A Measure of Association for Nonparametric Statistics] (Alan C. Acock and Gordon R. Stavig Page 1381 of 1381–1386) |
||
* [http://www.people.vcu.edu/~pdattalo/702SuppRead/MeasAssoc/NominalAssoc. |
* [http://www.people.vcu.edu/~pdattalo/702SuppRead/MeasAssoc/NominalAssoc.html Nominal Association: Phi and Cramer's Vl] from the homepage of Pat Dattalo. |
||
{{Statistics}} |
{{Statistics}} |
||
Line 80: | Line 78: | ||
[[Category:Statistical ratios]] |
[[Category:Statistical ratios]] |
||
[[Category:Summary statistics for contingency tables]] |
[[Category:Summary statistics for contingency tables]] |
||
[[Category:Covariance and correlation]] |
Latest revision as of 20:47, 28 March 2024
In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946.[1]
Usage and interpretation
[edit]φc is the intercorrelation of two discrete variables[2] and may be used with variables having two or more levels. φc is a symmetrical measure: it does not matter which variable we place in the columns and which in the rows. Also, the order of rows/columns does not matter, so φc may be used with nominal data types or higher (notably, ordered or numerical).
Cramér's V varies from 0 (corresponding to no association between the variables) to 1 (complete association) and can reach 1 only when each variable is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation.
φc2 is the mean square canonical correlation between the variables.[citation needed]
In the case of a 2 × 2 contingency table Cramér's V is equal to the absolute value of Phi coefficient.
Calculation
[edit]Let a sample of size n of the simultaneously distributed variables and for be given by the frequencies
- number of times the values were observed.
The chi-squared statistic then is:
where is the number of times the value is observed and is the number of times the value is observed.
Cramér's V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1:
where:
- is the phi coefficient.
- is derived from Pearson's chi-squared test
- is the grand total of observations and
- being the number of columns.
- being the number of rows.
The p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test.[citation needed]
The formula for the variance of V=φc is known.[3]
In R, the function cramerV()
from the package rcompanion
[4] calculates V using the chisq.test function from the stats package. In contrast to the function cramersV()
from the lsr
[5] package, cramerV()
also offers an option to correct for bias. It applies the correction described in the following section.
Bias correction
[edit]Cramér's V can be a heavily biased estimator of its population counterpart and will tend to overestimate the strength of association. A bias correction, using the above notation, is given by[6]
where
and
Then estimates the same population quantity as Cramér's V but with typically much smaller mean squared error. The rationale for the correction is that under independence, .[7]
See also
[edit]Other measures of correlation for nominal data:
- The Percent Maximum Difference[8]
- The phi coefficient
- Tschuprow's T
- The uncertainty coefficient
- The Lambda coefficient
- The Rand index
- Davies–Bouldin index
- Dunn index
- Jaccard index
- Fowlkes–Mallows index
Other related articles:
References
[edit]- ^ Cramér, Harald. 1946. Mathematical Methods of Statistics. Princeton: Princeton University Press, page 282 (Chapter 21. The two-dimensional case). ISBN 0-691-08004-6 (table of content Archived 2016-08-16 at the Wayback Machine)
- ^ Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.
- ^ Liebetrau, Albert M. (1983). Measures of association. Newbury Park, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No. 32. (pages 15–16)
- ^ "Rcompanion: Functions to Support Extension Education Program Evaluation". 2019-01-03.
- ^ "Lsr: Companion to "Learning Statistics with R"". 2015-03-02.
- ^ Bergsma, Wicher (2013). "A bias correction for Cramér's V and Tschuprow's T". Journal of the Korean Statistical Society. 42 (3): 323–328. doi:10.1016/j.jkss.2012.10.002.
- ^ Bartlett, Maurice S. (1937). "Properties of Sufficiency and Statistical Tests". Proceedings of the Royal Society of London. Series A. 160 (901): 268–282. Bibcode:1937RSPSA.160..268B. doi:10.1098/rspa.1937.0109. JSTOR 96803.
- ^ Tyler, Scott R.; Bunyavanich, Supinda; Schadt, Eric E. (2021-11-19). "PMD Uncovers Widespread Cell-State Erasure by scRNAseq Batch Correction Methods". BioRxiv: 2021.11.15.468733. doi:10.1101/2021.11.15.468733.
External links
[edit]- A Measure of Association for Nonparametric Statistics (Alan C. Acock and Gordon R. Stavig Page 1381 of 1381–1386)
- Nominal Association: Phi and Cramer's Vl from the homepage of Pat Dattalo.