Cramér's V: Difference between revisions

Content deleted Content added

Inline

Revision as of 17:50, 28 May 2018

In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φ_c) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946.^[1]

Usage and interpretation

φ_c is the intercorrelation of two discrete variables^[2] and may be used with variables having two or more levels. φ_c is a symmetrical measure, it does not matter which variable we place in the columns and which in the rows. Also, the order of rows/columns doesn't matter, so φ_c may be used with nominal data types or higher (notably ordered or numerical).

Cramér's V may also be applied to goodness of fit chi-squared models when there is a 1 × k table (in this case r = 1). In this case k is taken as the number of optional outcomes and it functions as a measure of tendency towards a single outcome. ^{[citation needed]}

Cramér's V varies from 0 (corresponding to no association between the variables) to 1 (complete association) and can reach 1 only when the two variables are equal to each other.

φ_c² is the mean square canonical correlation between the variables.^{[citation needed]}

In the case of a 2 × 2 contingency table Cramér's V is equal to the Phi coefficient.

Note that as chi-squared values tend to increase with the number of cells, the greater the difference between r (rows) and c (columns), the more likely φ_c will tend to 1 without strong evidence of a meaningful correlation.^{[citation needed]}

V may be viewed as the association between two variables as a percentage of their maximum possible variation. V² is the mean square canonical correlation between the variables. ^{[citation needed]}

Calculation

Let a sample of size n of the simultaneously distributed variables $A$ and $B$ for $i=1,\ldots ,r;j=1,\ldots ,k$ be given by the frequencies

n_{ij}=

number of times the values

(A_{i},B_{j})

were observed.

The chi-squared statistic then is:

\chi ^{2}=\sum _{i,j}{\frac {(n_{ij}-{\frac {n_{i.}n_{.j}}{n}})^{2}}{\frac {n_{i.}n_{.j}}{n}}}

Cramér's V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1:

V={\sqrt {\frac {\varphi ^{2}}{\min(k-1,r-1)}}}={\sqrt {\frac {\chi ^{2}/n}{\min(k-1,r-1)}}}

where:

$\varphi$ is the phi coefficient.
$\chi ^{2}$ is derived from Pearson's chi-squared test
$n$ is the grand total of observations and
$k$ being the number of columns.
$r$ being the number of rows.

The p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test.^{[citation needed]}

The formula for the variance of V=φ_c is known.^[3]

In R, the function cramersV() from the lsr package, calculates V using the chisq.test function from the stats package.^[4]

Bias correction

Cramér's V can be a heavily biased estimator of its population counterpart and will tend to overestimate the strength of association. A bias correction, using the above notation, is given by^[5]

{\tilde {V}}={\sqrt {\frac {{\tilde {\varphi }}^{2}}{\min({\tilde {k}}-1,{\tilde {r}}-1)}}}

where

{\tilde {\varphi }}^{2}=\max \left(0,\varphi ^{2}-{\frac {(k-1)(r-1)}{n-1}}\right)

and

{\tilde {k}}=k-{\frac {(k-1)^{2}}{n-1}}

{\tilde {r}}=r-{\frac {(r-1)^{2}}{n-1}}

Then ${\tilde {V}}$ estimates the same population quantity as Cramér's V but with typically much smaller mean squared error. The rationale for the correction is that under independence, $E\varphi ^{2}={\frac {(k-1)(r-1)}{n-1}}$ .^[6]

References

^ Cramér, Harald. 1946. Mathematical Methods of Statistics. Princeton: Princeton University Press, page 282 (Chapter 21. The two-dimensional case). ISBN 0-691-08004-6 (table of content)
^ Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.
^ Liebetrau, Albert M. (1983). Measures of association. Newbury Park, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No. 32. (pages 15–16)
^ http://artax.karlin.mff.cuni.cz/r-help/library/lsr/html/cramersV.html
^ Bergsma, Wicher (2013). "A bias correction for Cramér's V and Tschuprow's T". Journal of the Korean Statistical Society. 42 (3): 323–328. doi:10.1016/j.jkss.2012.10.002.
^ Bartlett, Maurice S. (1937). "Properties of Sufficiency and Statistical Tests". Proceedings of the Royal Society of London. Series A. 160 (901): 268–282. JSTOR 96803.

External links

A Measure of Association for Nonparametric Statistics (Alan C. Acock and Gordon R. Stavig Page 1381 of 1381–1386)
Nominal Association: Phi and Cramer's Vl ^{[dead link‍]} from the homepage of Pat Dattalo.

[1] Cramér, Harald. 1946. Mathematical Methods of Statistics. Princeton: Princeton University Press, page 282 (Chapter 21. The two-dimensional case). ISBN 0-691-08004-6 (table of content)

[Ref_a-2] Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

[3] Liebetrau, Albert M. (1983). Measures of association. Newbury Park, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No. 32. (pages 15–16)

[4] ttp://artax.karlin.mff.cuni.cz/r-help/library/lsr/html/cramersV.html

[bergsma13-5] Bergsma, Wicher (2013). "A bias correction for Cramér's V and Tschuprow's T". Journal of the Korean Statistical Society. 42 (3): 323–328. doi:10.1016/j.jkss.2012.10.002.

[6] Bartlett, Maurice S. (1937). "Properties of Sufficiency and Statistical Tests". Proceedings of the Royal Society of London. Series A. 160 (901): 268–282. JSTOR 96803.

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 3: / Line 3: @@
 ==Usage and interpretation==
-φ<sub>''c''</sub> is the intercorrelation of two discrete variables<ref name="Ref_a">Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.</ref> and may be used with variables having two or more levels.  φ<sub>''c''</sub> is a symmetrical measure, it does not matter which variable we place in the columns and which in the rows.  Also, the order of rows/columns doesn't matter, so φ<sub>''c''</sub> may be used with nominal data types or higher (ordered, numerical, etc.)
+φ<sub>''c''</sub> is the intercorrelation of two discrete variables<ref name="Ref_a">Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.</ref> and may be used with variables having two or more levels.  φ<sub>''c''</sub> is a symmetrical measure, it does not matter which variable we place in the columns and which in the rows.  Also, the order of rows/columns doesn't matter, so φ<sub>''c''</sub> may be used with nominal data types or higher (notably ordered or numerical).
-Cramér's V may also be applied to [[goodness of fit]] chi-squared models when there is a 1&nbsp;× ''k'' table (e.g.: ''r''&nbsp;= 1). In this case ''k'' is taken as the number of optional outcomes and it functions as a measure of tendency towards a single outcome. {{citation needed|date=January 2016}}
+Cramér's V may also be applied to [[goodness of fit]] chi-squared models when there is a 1&nbsp;× ''k'' table (in this case ''r''&nbsp;= 1). In this case ''k'' is taken as the number of optional outcomes and it functions as a measure of tendency towards a single outcome. {{citation needed|date=January 2016}}
 Cramér's V varies from 0 (corresponding to [[Independence (probability theory)|no association]] between the variables) to 1 (complete association) and can reach 1 only when the two variables are equal to each other.