Talk:Cohen's kappa: Difference between revisions
m Maintain {{WPBS}} and vital articles: 1 WikiProject template. Create {{WPBS}}. Keep majority rating "C" in {{WPBS}}. Remove 1 same rating as {{WPBS}} in {{WPStatistics}}. Tag: |
→Approximate confidence interval is very wrong: new section |
||
Line 216: | Line 216: | ||
Cheers.—[[User:InternetArchiveBot|'''<span style="color:darkgrey;font-family:monospace">InternetArchiveBot</span>''']] <span style="color:green;font-family:Rockwell">([[User talk:InternetArchiveBot|Report bug]])</span> 07:40, 10 August 2017 (UTC) |
Cheers.—[[User:InternetArchiveBot|'''<span style="color:darkgrey;font-family:monospace">InternetArchiveBot</span>''']] <span style="color:green;font-family:Rockwell">([[User talk:InternetArchiveBot|Report bug]])</span> 07:40, 10 August 2017 (UTC) |
||
== Approximate confidence interval is very wrong == |
|||
In the section "Hypothesis testing and confidence interval" the following formula is given : <math>SE_{\kappa} = \sqrt{{p_o (1- p_o)} \over { N (1- p_e) ^2 }}</math> |
|||
The warnings given about these formula are far from enough to make it correct: |
|||
<blockquote> |
|||
This is calculated by ignoring that {{mvar|p<sub>e</sub>}} is estimated from the data, and by treating {{mvar|p<sub>o</sub>}} as an estimated probability of a binomial distribution while using asymptotic normality (i.e.: assuming that the number of items is large and that {{mvar|p<sub>o</sub>}} is not close to either 0 or 1). |
|||
</blockquote> |
|||
However, I can find a very simple case where it is horrendously wrong while {{mvar|p<sub>e</sub>}} and {{mvar|p<sub>o</sub>}} are both equal to 50% and the sample is very large (smallest cell >= 100). |
|||
The sample is described by this contingency table: |
|||
{| class="wikitable" |
|||
| 1000 |
|||
| 100 |
|||
|- |
|||
| 1000 |
|||
| 100 |
|||
|} |
|||
The Cohen's Kappa is equal to 0, with {{mvar|p<sub>o</sub>}}=0.5 and {{mvar|p<sub>e</sub>}}=0.5. |
|||
An unbiased estimator (bootstap) finds a standard error equal to 0.012. The asymptotic variance estimator of the irrCAC R package finds the same standard error (0.012). |
|||
The incorrect variance estimator cited by Wikipedia finds a standard error at 0.021. |
|||
This can be explained by the fact that the variance of {{mvar|p<sub>e</sub>}} is neglected as well as the covariance between {{mvar|p<sub>e</sub>}} and {{mvar|p<sub>o</sub>}} but actually <math>p_o - p_e</math> (numerator of Cohen's Kappa) is highly dependent on this variance and covariance. |
|||
A simple Monte Carlo simulation show that the standard deviations of {{mvar|p<sub>o</sub>}} and {{mvar|p<sub>e</sub>}} are respectively equal to 0.011 and 0.0087 with a Pearson's correlation coefficient at 0.81 between the two variables. Consequently, the standard error of <math>p_o - p_e</math> is actually 0.0061 but is incorrectly approximated as the standard deviation of {{mvar|p<sub>o</sub>}} (0.011) by Wikipedia's formula. |
|||
The following code in the R programming language explains the behaviour: |
|||
<syntaxhighlight lang="R"> |
|||
ckappa.table=function(tbl) { |
|||
tbl=prop.table(tbl) |
|||
expected = colSums(tbl) %o% rowSums(tbl) |
|||
pe=sum(diag(expected)) |
|||
po=sum(diag(tbl)) |
|||
c(kappa=(po-pe)/(1-pe), po=po, pe=pe) |
|||
} |
|||
ckappa.var.wikipedia=function(tbl) { |
|||
ck=ckappa.table(tbl) |
|||
po=ck["po"] |
|||
pe=ck["pe"] |
|||
N=sum(tbl) |
|||
po*(1-po)/(N*(1-pe)^2) |
|||
} |
|||
tabl=rbind(c(1000,100),c(1000,100)) |
|||
ckappa.table(tabl) |
|||
set.seed(2024) |
|||
mat=cbind(rater1=rep(c(0,1,0,1), tabl), rater2=rep(c(0,0,1,1), tabl)) |
|||
irrCAC::conger.kappa.raw(mat)$est[1,"coeff.se"] |
|||
sqrt(ckappa.var.wikipedia(tabl)) |
|||
bt=boot::boot(data=mat, function(data, idx) { |
|||
ckappa.table(table(data[idx,1], data[idx,2])) |
|||
},R=1000) |
|||
colnames(bt$t)=names(bt$t0) |
|||
apply(bt$t, 2, sd) |
|||
cor(bt$t) |
|||
sd(bt$t[,"po"]) |
|||
sd(bt$t[,"po"] - bt$t[,"pe"]) |
|||
sd(bt$t[,"po"]) |
|||
</syntaxhighlight> |
|||
The error is even worse for more unbalanced tables. Therefore, I think that the formula <math>SE_{\kappa} = \sqrt{{p_o (1- p_o)} \over { N (1- p_e) ^2 }}</math> is too wrong to be worth mentioning on the Wikipedia. Conditionning on {{mvar|p<sub>e</sub>}} was a bad idea, because the variance and covariances of {{mvar|p<sub>o</sub>}} and {{mvar|p<sub>e</sub>}} can be high. |
|||
The article cited by Wikipedia for this standard error caculation (https://pmc.ncbi.nlm.nih.gov/articles/PMC3900052/) is correctly cited but provides a wrong formula in my opinion. The article does not provide any rationale or theoretical explanation for this formula and does not cite any other article that would provide an explanation. Since it is a very general pedagogical article, the formula was probably invented by the article's author and never thoroughly tested. |
|||
I propose to cite a much better article, providing an asymptotically correct formula: Fleiss J, Cohen J. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 1969, Vol. 72, No. 5, 323-327. [[Special:Contributions/37.165.113.230|37.165.113.230]] ([[User talk:37.165.113.230|talk]]) 00:05, 28 December 2024 (UTC) |
Latest revision as of 00:05, 28 December 2024
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
What about unifying this with http://en.wikipedia.org/wiki/Fleiss%27_kappa . It seems to be the same quantity... —Preceding unsigned comment added by 72.70.76.11 (talk) 23:11, 18 January 2010 (UTC)
Exellent sweep work and getting that formula in order. Cheers. --Piewalker 21:49, 25 Apr 2005 (UTC)
Dispute with AbsolutDan
[edit]I found this message on my Talk page: "Please stop. If you continue to use Wikipedia for advertising, as you did in Cohen's kappa, you will be blocked from editing. A link to the edit I have reverted can be found here: link. If you believe this edit should not have been reverted, please contact me. --AbsolutDan (talk) 12:49, 4 August 2006 (UTC)" Can someone provide an opinion on the suitability of the link to Cohen's Kappa Example: http://www.6sigma101.com/glossary/cohen_kappa.htm#1 What is wrong with this example? In what way is this advertizing? —The preceding unsigned comment was added by 203.214.51.192 (talk • contribs) .
- The formula looks fine at that link. Advertising or not, the 2x2 tables are helpful. It's the correct formula. Leaving that page, to be honest I don't feel an overwhelming, compulsory, capitalistic urge to buy anything. In fact, after visiting that page I want to buy less stuff and use the formula to gauge interrater agreement between other people who also went there and want to buy less. Piewalker 15:10, 4 August 2006 (UTC)
Thanks, Piewalker. Do you support re-adding the link?
- Sure. We can always remove it later if the forces of good or evil collectively decide this one hyperlink sucks. Piewalker 16:00, 4 August 2006 (UTC)
- That's opposite of how it works, the inclusion of links has to be justified, not their removal. You need to explain first, what makes the link a unique resource to the encyclopedic content that isn't already in the article or could be easily included? Femto 18:04, 4 August 2006 (UTC)
- I guess you're right, Femto. Even though I already justified inclusion of the link with my first post in sentence 2, 3, and 4, I suppose it's not necessarily a "unique" resource. One could get this data from many places. I still think 2x2 tables are excellent examples that are easily modifiable for adaptation to the article (so it's not plagiarism). Thanks for volunteering yourself to do that, "Male, European, and already paranoid about giving away this much information." Holler if you need help. Cheerio. Piewalker 18:42, 4 August 2006 (UTC)
- I'm not volunteering, 203.214.51.192 should. Instead of keeping to re-include contentious external links, add truly free internal content to Wikipedia. Femto 18:54, 4 August 2006 (UTC)
Dispute with Femto
[edit]So that you can remove it at your whim? It is very clear that Wikipedia is about spam wars by self-confessed vigilante like you and AbsolutDan, and not about supporting genuine contributions. As you know yourself you violate Wikipedia rules by removing content supported by another editor who has proven that the advertising point made by AbsolutDan is simply not true. As you can see from this discussion records the other external link here never had to be justified (I am not suggesting to remove it though). This is a clear example of your selective, vindictive approach. You made it "contentious" and totally ignored its content. What is "truly free internal content"? How is http://www.6sigma101.com/glossary/cohen_kappa.htm#1 not free? Is it because you cannot see it in Opera? —The preceding unsigned comment was added by 203.214.51.192 (talk • contribs) 21:29, 4 August 2006 (UTC)
- Reply copied from User talk:Femto - Contribute cited text, not bare links. External links are not content. The page is Copyright ©2006 MiC Quality, under exclusive control of that company, and not part of Wikipedia's GNU Free Documentation License. I'm not "removing professional content from Wikipedia", you never added it. You showed no other interest in improving Wikipedia, only in placing links to the company you're associated with. This is advertising, and your insistent inclusion of links is spamming. Femto 12:40, 5 August 2006 (UTC)
As I've mentioned in a couple places already, 6sigma101.com has been spammed across multiple articles by several usernames and IPs which appear to be working in concert. Please see the following:
- Special:Contributions/203.214.69.7
- Special:Contributions/Goskan
- Special:Contributions/Glen netherwood
There was a heated discussion about the link here: Talk:Six Sigma where it was determined the link was not extremely helpful and is to a site that is intended to promote their services (6sigma training). If you do a WHOIS on 203.214.69.7 ([1]) and 203.214.51.192 ([2]), you can see they both come from the same ISP. It seems apparent that this is simply the new IP of the above contributor(s), back to try to include links to their site, in blatent violation of guidelines. Furthermore, removing the link is not against guidelines unless there is a consensus in favor of the link. One agreement does not a consensus make, especially when there are 2 people not in favor of the link. Besides, the one person who did initially agree with the link later decided against it. Please stop pushing the 6sigma101 link and add useful content to Wikipedia. --AbsolutDan (talk) 21:41, 4 August 2006 (UTC)
- As an outside observer, I viewed the link to 6sigma101 and found the example to be not only a good one but lacking of any particular "buy our services" content that you describe. While that material may be elsewhere on the site and other links by the users described may better advertise for their services, it is not even a secondary purpose for the link provided on this page. ju66l3r 21:46, 4 August 2006 (UTC)
- Thank you. Your comment is like a breath of fresh air. I hope there will be more professional people like you supporting useful contributions from other professional people. —The preceding unsigned comment was added by 203.214.51.192 (talk • contribs) .
- As given for the rationale of my last revert, linkspam does not have to be commercial to be promotional. Professional people are still invited to contribute content instead of links. Femto 13:18, 5 August 2006 (UTC)
Proposal to Re-enter Link to The Example
[edit]As before I would like to add the link to the example on Cohen's Kappa. If you support it please confirm it here. TIA
- It's probably best to not relink it, especially considering the evidence presented above. I proposed earlier that you adapt the 2x2 tables and then integrate them into the article instead of hyperlinking to a proprietary page. Adapt the content, it's not that hard. True it's a little bit more work, but that's what you're gonna have to do in this case, dude. Oh, and you probably should register at Wikipedia and get yourself a user name if you want to be taken seriously here. Cheerio. Piewalker 22:32, 4 August 2006 (UTC)
I must say that the link gives good information. I must concede too that I did not regard it as unduly "promotional". The argument of Femto, however, is not to be dismissed easily. Thus, incorporation of the illustration sample in the article would be a benefit. —The preceding unsigned comment was added by 217.194.34.103 (talk • contribs) .
- The owner of a site was pushing to add links to it, if that's not promotional, what is it? Adding true internal content would be fine. Femto 16:01, 6 October 2006 (UTC)
- I agree with Femto that the additions of "true internal content would be fine." Absolutely. Sure, there's some information at that site, but its interspersed with promotional material. Adapt the concepts of the content there in your own way! Ideas cannot be copyrighted. The Associated Press Stylebook and Libel Manual (32nd edition, 1997) notes the following on page 303: "The broadest limitation on the reach of copyright law is that ideas and facts are never protected by a copyright." Adapt it. Don't link to it. There's no compelling reason to link to a page. I agree, an illustration of the concepts presented in that particular link would be helpful in this article, but those concepts can be found at dozens of other web sites as well, not to mention text books and a litany of peer-reviewed journal articles that have used this in practice. In this case, you must do the work to adapt the content, or collaborate with others to aptly communicate the principles of Kappa. What we're after here at Wikipedia is the truth, not to advertise. I'm against linking. I'm not against representing truth. So, anonymous user, again I respectfully request that you register at Wikipedia and sign your comments. Above all, I appreciate that you're trying to make this article better. Piewalker 16:19, 6 October 2006 (UTC)
And our namesake?
[edit]Would anyone like to create a brief article about Jacob Cohen? Previously, the name linked to Rodney Dangerfield, as Jacob Cohen was apparently his given name. I added a link here to that page so that if someone does finally make a Jacob Cohen (measurement researcher) page, it will be properly linked, assuming the author takes steps to disambiguate it from the actor. —Preceding unsigned comment added by Carinamc (talk • contribs) 04:31, 2 February 2008 (UTC)
Not quite sure about the 2x2 example used
[edit]I thought this article well written and quite informative. But I had a problem with example used. A 2x2 matrix was used, and it showed the percentage of YES/NO ratings for each rater. But I simply don't see how one can compute the relative agreement between the two raters in this case. For example, if two reviewers of student essays are rating the same 50 essays, I could not tell how much they agreed if both rated 50% of the essays GOOD and 50% of the essays BAD. Instead, one would need to know if for each document, was there agreement or not between the two raters. So I kind of puzzled over this, until I decided just to ignore the example.
So personally I'd like the example made a bit more compelling. —Preceding unsigned comment added by Terralinda (talk • contribs) 23:18, 11 September 2009 (UTC)
The table has been more readily explained.Farnberg (talk) 17:33, 19 November 2009 (UTC)
- Well, i am quite sure that the example is non-sense. "Does not alway produce the expected answer" ... you can say that about every method. Therefore i have removed it and placed it here. (Being bold, but hopefully not to destructive on the efforts of others.) 78.55.34.82 (talk) 15:18, 3 February 2010 (UTC)
Inconsistent results
[edit]One of the problems with Cohen's Kappa is that it does not always produce the expected answer[1]. For instance, in the following two cases there is much greater agreement between A and B in the first case[why?] than in the second case and we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each:
Yes | No | |
Yes | 45 | 15 |
No | 25 | 15 |
Yes | No | |
Yes | 25 | 35 |
No | 5 | 35 |
we find that it shows greater similarity between A and B in the second case, compared to the first.
take out the Inconsistent Results section !!!!!
[edit]Think of a 2x2 results grid like
True Positive || False Positive
True Neagative || False Negative
If people "agree" on the TP and the FN, cool. Great. So what? (TP + FP) = what exactly ?
Cohen's k is inconsistent only if one thinks that TP and FP inextricably exist in the same category and should behave in perfect positive correlation one to another. Because they don't, obviously, kappa is not an inconsistent measure. One simply must understand exactly WHAT is being calculated. Does this make any sense? I wish someone would fix this. My wiki syntax control is still weak as you can see. —Preceding unsigned comment added by 130.234.5.137 (talk) 22:14, 20 December 2010 (UTC)
Well the scenario you pose is a little different. If we know the true value, we are looking at agreement with a standard (a.k.a. validity), not with another rater. (As you say, just because raters agree with each other doesn't make them right...) So I could look at Bob's agreement to the standard and Mary's agreement to the standard individually using kappa. One or the other could have better agreement above that of chance. But if we are to compare the system of "Bob and Mary" and its overall ability to agree with the standard, I would suggest using R.J. Light's G, since the standard is the real value it has special significance. G kind of generates a pooled agreement with the standard across all the individual assessors.128.138.77.197 (talk) 16:00, 19 May 2017 (UTC)
Inconsistent results...are they?
[edit]It's not at all clear to me (either from the Wikipedia page or from the cited paper) why the results listed in the "Inconsistent results" section are inconsistent. The article doesn't explain why, and the source paper merely claims "Most researchers would prefer in such situations, an agreement statistic that would yield a higher inter-rater reliability for [the first] experiment." First off, I'm pretty sure the quoted claim would be tagged with "who?" were it to be posted on Wikipedia. Second, the way I see it, the raters in the two experiments agree on the same number of items and disagree on the same number of items. Why should the agreement statistics differ? Can anyone explain this? Cheers! Matt Gerber (talk) 04:08, 9 December 2009 (UTC)
- In the two data sets the total number of items on which the raters agree is the same and the total number of items on which they disagree is the same. The difference is that the disagreement numbers are heavily skewed towards one of the possible disagreement cases. Does this mean there is greater agreement for one set of results compared to the other? I don't know. I would agree that the two sets of data should have the same 'similarity' value. Derek farn (talk) 12:54, 9 December 2009 (UTC)
- In which case the example still shows something inconsistent about Cohen's Kappa, but for a reason other than the one presumed by the writers of the article/paper, who believe the first experiment should show greater agreement (still not sure why). I've surveyed some literature recently, and found what I think it is much more intuitive case of inconsistency. Check it out here, on page 99, examples 3 and 4. The only difference between the two is that the raters agreed in different ways. Examples 3 and 4 in the paper should intuitively show very high agreement, and example 4 does (k=0.8) - but the other shows negative agreement (i.e., disagreement) at k=-0.048. Does anyone object to replacing the current example with the one I've linked to? Matt Gerber (talk) 15:50, 9 December 2009 (UTC)
- Fine by me. Derek farn (talk) 17:54, 9 December 2009 (UTC)
- In which case the example still shows something inconsistent about Cohen's Kappa, but for a reason other than the one presumed by the writers of the article/paper, who believe the first experiment should show greater agreement (still not sure why). I've surveyed some literature recently, and found what I think it is much more intuitive case of inconsistency. Check it out here, on page 99, examples 3 and 4. The only difference between the two is that the raters agreed in different ways. Examples 3 and 4 in the paper should intuitively show very high agreement, and example 4 does (k=0.8) - but the other shows negative agreement (i.e., disagreement) at k=-0.048. Does anyone object to replacing the current example with the one I've linked to? Matt Gerber (talk) 15:50, 9 December 2009 (UTC)
Significance
[edit]The part about significance doesn't actually say anything about significance. Stata gives me some p-value, though, so there must be some way? --Sineuve (talk) 07:01, 16 June 2010 (UTC)
Significance and Magnitude
[edit]The section currently labeled Significance is mainly about magnitude. I changed its name to Significance and Magnitude and added a sentence about significance (with citations), explaining why significance is rarely reported. I added information about magnitude (with citations). Specifically, I note three factors than can affect the magnitude of kappa (bias, prevalence, number of codes) and make the point that, given these factors, no particular magnitude of kappa can be regarded as universally acceptable.
Nonetheless, I left the current prose that gives Landis and Koch’s and Fleiss’s guidelines for magnitude. I put Koch and Landis’s guidelines in prose, not in a table as currently, because the table gives them a prominence not in line with the prose that describes them as problematic. I also edited the current prose some. Current prose is, “It has been noted that these guidelines may be more harmful than helpful” and continues “as the number of categories and subjects will affect the magnitude of the value. The kappa will be higher when there are fewer categories.” I left the “more harmful than helpful” but I deleted the phrase and sentence in the last quote because it is inaccurate.
I was puzzled by the statement that kappa will be higher with fewer categories, when I know from reading and my own work that the opposite is true. The current prose cites Sim and Wright (2005) for this statement. True, Sim and Wright do say this, but it is not a major point in their otherwise thorough and careful article. Their citation for the statement is: Maclure, M., & Willett, W. C. (1987). Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126, 161–169.
As evidence for the statement, Maclure and Willett present a single 12x12 table and compute its kappa. They then collapse the table into 6x6, 4x4, 3x3, and 2x2 tables. For each successive table, the value of kappa does increase, but this is due to the nature of the data in the initial table. Maclure and Willett knew this. They wrote, “Clearly, in this situation, the values for Kappa are so greatly influenced by the number of categories that a four-category-Kappa for ordinal data cannot be compared with a three-category Kappa” (p. 163). Thus Maclure and Willett did not claim that kappa would be lower with more codes generally, but only in the situation where ordinal codes are collapsed. Additionally, collapsing doesn’t tell us what observers of a given accuracy would do when choosing among 12, 6, etc. codes. The simulation study cited in my modified prose provides evidence for the statement that, other things being equal and for nominal as well as ordinal codes, kappa decreases when codes are fewer (i.e., when coders of a specified accuracy chose one of a specified number of codes.
RABatl (talk) 20:21, 29 June 2010 (UTC)
Weighted Kappa
[edit]Weighted kappa can be used in cases where you can order the labels. This sentence " especially useful when codes are ordered" is not very clear. It should give an example saying that if you have label1<label2<label3 in a "semantic way", you can use weighted kappa. Furthermore, I think the weight matrix should be better explained. It has two forms: linear or quadratic. In linear case, w_ij=1-(|i-j|/max_d), where max_d=max(|i-j|) for all i and j in the label domain. If your label domain is 1,2,...n, then max_d = n-1.
In quadratic case, w_ij=1-(|i-j|/max_d)^2.
I needed to use weighted kappa for a small work, but could not find a code. I think looking at the code can better help people understand how you actually calculate the value. I wrote an implementation in PHP. I added it to the external links, but the bot tends to delete it. If the link needs to be deleted, please copy the code into the article, but definitely keep it.
love you guys, keep up the good work — Preceding unsigned comment added by 193.206.170.151 (talk) 13:48, 1 June 2011 (UTC)
Advice
[edit]The last link in Online Calculators section does not refer to an online calculator. Please move it to External Links section. — Preceding unsigned comment added by 89.96.28.120 (talk) 10:20, 30 November 2011 (UTC)
Inconsistent results (again)
[edit]This has been brought up before, but I think the "Inconsistent results" section is misleading. The fact that kappa can differ as percent agreement stays the same is a feature, not a bug. Otherwise, we would simply use percent agreement as the sole measure of interrater agreement. In the example cited, the reason the kappa is greater in the second case is that the two raters have managed to obtain the same agreement rate despite the fact that one of them has many more "Yes" results than the other. Honestly, I would rather a section with those examples read, "One of the advantages of Cohen's kappa is that it can distinguish between two sets of raters with the same agreement rate."
There are, of course, some valid criticisms of Cohen's kappa. I just don't think this is one of them. I see that it has been like this for quite some time, though, so perhaps I am misinterpreting. Thoughts? FCSundae ∨☃ (talk) 11:50, 8 December 2011 (UTC)
Inconsistent results (again and again)
[edit]This section is misleading. The results aren't 'inconsistent'. The prior distributions of answers aren't the same, so there's no reason to expect identical cohen's kappas. I believe that the beginning of the article is great, and explains that this "inconsistency" is part of the statistics. Since there has been multiple remarks about this on this discussion page, I removed the section. — Preceding unsigned comment added by Gapar2 (talk • contribs) 01:06, 4 February 2012 (UTC)
Weighted Kappa formula incorrect?
[edit]The formula for the weighted kappa appears to be incorrect, but I wasn't certain enough to edit it. It appears as:
while in Cohen's 1968 paper (which we link to in the article), it appears as
Am I missing something? Zephyrus Tavvier (talk) 04:34, 10 May 2015 (UTC)
Carletta's kappa is not Cohen's kappa
[edit]I removed the reference
- Carletta, Jean. (1996) Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), pp. 249–254.
because, apparently, what Carletta is arguing for is not Cohen's kappa:
The literature is full of terminological inconsistencies. Carletta calls the coefficient of agreement she argues for “kappa,” referring to Krippendorff (1980) and Siegel and Castellan (1988), and using Siegel and Castellan’s terminology and definitions. However, Siegel and Castellan’s statistic, which they call K, is actually Fleiss’s generalization to more than two coders of Scott’s π, not of the original Cohen’s κ; to confuse matters further, Siegel and Castellan use the Greek letter κ to indicate the parameter which is estimated by K.
QVVERTYVS (hm?) 14:16, 30 July 2015 (UTC)
First example seems to have an error
[edit]I believe the first "math" image in the example does not match the rest of the example. I refer to the formula below the text "The observed proportionate agreement is:", and which concludes p(o) ≈ .915. It appears as if the image is using different variables in the example. In the rest of the example, p(o) = .7.
I am not confident enough to edit this, though. Can someone else confirm? Thank you. 199.217.6.200 (talk) 16:09, 31 March 2017 (UTC)
External links modified
[edit]Hello fellow Wikipedians,
I have just modified one external link on Cohen's kappa. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://web.archive.org/web/20140201193356/https://mlnl.net/jg/software/ira/ to https://mlnl.net/jg/software/ira/
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}}
(last update: 5 June 2024).
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—InternetArchiveBot (Report bug) 07:40, 10 August 2017 (UTC)
Approximate confidence interval is very wrong
[edit]In the section "Hypothesis testing and confidence interval" the following formula is given :
The warnings given about these formula are far from enough to make it correct:
This is calculated by ignoring that pe is estimated from the data, and by treating po as an estimated probability of a binomial distribution while using asymptotic normality (i.e.: assuming that the number of items is large and that po is not close to either 0 or 1).
However, I can find a very simple case where it is horrendously wrong while pe and po are both equal to 50% and the sample is very large (smallest cell >= 100). The sample is described by this contingency table:
1000 | 100 |
1000 | 100 |
The Cohen's Kappa is equal to 0, with po=0.5 and pe=0.5. An unbiased estimator (bootstap) finds a standard error equal to 0.012. The asymptotic variance estimator of the irrCAC R package finds the same standard error (0.012). The incorrect variance estimator cited by Wikipedia finds a standard error at 0.021. This can be explained by the fact that the variance of pe is neglected as well as the covariance between pe and po but actually (numerator of Cohen's Kappa) is highly dependent on this variance and covariance. A simple Monte Carlo simulation show that the standard deviations of po and pe are respectively equal to 0.011 and 0.0087 with a Pearson's correlation coefficient at 0.81 between the two variables. Consequently, the standard error of is actually 0.0061 but is incorrectly approximated as the standard deviation of po (0.011) by Wikipedia's formula.
The following code in the R programming language explains the behaviour:
ckappa.table=function(tbl) {
tbl=prop.table(tbl)
expected = colSums(tbl) %o% rowSums(tbl)
pe=sum(diag(expected))
po=sum(diag(tbl))
c(kappa=(po-pe)/(1-pe), po=po, pe=pe)
}
ckappa.var.wikipedia=function(tbl) {
ck=ckappa.table(tbl)
po=ck["po"]
pe=ck["pe"]
N=sum(tbl)
po*(1-po)/(N*(1-pe)^2)
}
tabl=rbind(c(1000,100),c(1000,100))
ckappa.table(tabl)
set.seed(2024)
mat=cbind(rater1=rep(c(0,1,0,1), tabl), rater2=rep(c(0,0,1,1), tabl))
irrCAC::conger.kappa.raw(mat)$est[1,"coeff.se"]
sqrt(ckappa.var.wikipedia(tabl))
bt=boot::boot(data=mat, function(data, idx) {
ckappa.table(table(data[idx,1], data[idx,2]))
},R=1000)
colnames(bt$t)=names(bt$t0)
apply(bt$t, 2, sd)
cor(bt$t)
sd(bt$t[,"po"])
sd(bt$t[,"po"] - bt$t[,"pe"])
sd(bt$t[,"po"])
The error is even worse for more unbalanced tables. Therefore, I think that the formula is too wrong to be worth mentioning on the Wikipedia. Conditionning on pe was a bad idea, because the variance and covariances of po and pe can be high. The article cited by Wikipedia for this standard error caculation (https://pmc.ncbi.nlm.nih.gov/articles/PMC3900052/) is correctly cited but provides a wrong formula in my opinion. The article does not provide any rationale or theoretical explanation for this formula and does not cite any other article that would provide an explanation. Since it is a very general pedagogical article, the formula was probably invented by the article's author and never thoroughly tested. I propose to cite a much better article, providing an asymptotically correct formula: Fleiss J, Cohen J. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 1969, Vol. 72, No. 5, 323-327. 37.165.113.230 (talk) 00:05, 28 December 2024 (UTC)