Cohen's Kappa for more than two annotators
with multiple classes
June 8th, 2010:
This tool is currently down for maintenance and addition of functionality, and will be updated and operational again soon. Until then, the old version is accessible here
Introduction
Agreement can be measured as percentage of the cases on which the coders or annotators agree, but it would be desirable to take expected agreement by chance into account. The Kappa statistic proposed by Cohen [1,2] calculates chance agreement using individual coder marginals:
The abovementioned statistics, however, are only addressing the case for two coders. For multiple coders, a Kappa statistic has been proposed [4,5] which is essentially a generalization of Scott's pi rather than a generalization of Cohen's Kappa, as the distributional characteristics of codings for a specific coder are averaged away. The kappa for multiple coders that can be calculated on this page is a generalization of Cohen's Kappa according to Krippendorff [6].
On this page, Cohen's kappa can be calculated for the case of more than two annotators (at most 9) with multiple classes.
Input format
In order to calculate the Kappa-statistics, the annotations for each coder should be specified in separate files: one for each coder. The files should be tab-separated text and the filename should be called after the coder name and have the .txt extension. For example:
color shape red round blue round yellow oval
color shape orange round blue round orange oval
color shape red round purple round yellow oval
In the files, the first line should contain the class labels; the other lines should contain the instances annotated. Note that the first line with the class labels must be identical in each file.
Output and results
The output of this program given the input files is a table with for all annotator pairs the observed proportion of agreement (PA), the proportion of agreement expected by chance (PE), and the Kappa-value with in the last column the Kappa-value per variable and the number of pairs that have been included in the calculation. For exampe, the ouput for the example files in the previous section: john.txt, joe.txt, and mary.txt would look as follows:
| Agreement scores | |||||||||||||||||||||||
| Variable | joe+john | joe+mary | john+mary | AVG | |||||||||||||||||||
| color | |||||||||||||||||||||||
| shape | |||||||||||||||||||||||
Calculating Cohen's Kappa
Please specify two or more files (one for each coder) in the format described above. For each possible coder pair, the percentage agreement, expected agreement, and kappa score is calculated and a kappa per variable is given.
After having specified the data files and having selected the button 'Calculate Kappa', a new window will be opened with the agreement scores. This application demo is limited to 2 MB of data files.If you have any questions or comments, please do not hesitate to contact me
References
| [1] | Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Education and Psychological Measurement, 20:37–46. |
| [2] |
Jean Carletta. 1996. Assessing agreement on classification
tasks: The kappa statistic. Computational Linguistics, 22(2):249–254. |
| [3] |
Scott,W. A. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19:127–141. |
| [4] |
J.L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382. |
| [5] | Sidney Siegel and John N. Castellan. 1988. Nonparametric Statistics. 2nd ed. McGraw-Hill. |
| [6] |
Klaus Krippendorff. 1980. Content Analysis: An Introduction
to its Methodology. Sage Publications, Beverly Hills, CA, USA. |
