This document contains statistical data regarding abstract evaluation for the APSA program. The current dataset is limited to the 418 abstracts submitted for the 2020 meeting. Future iterations could be expanded to contain data from previous years
This dataset is stratified by individual raters per column, therefore it will now be possible to do the ‘hawks’ and ‘doves’ analysis. This will be done in the future.
The project was launched in an effort to evaluate and possibly increase overall inter-rater reliability for the APSA abstract submission.
| Total.Ratings.Submitted |
|---|
| 5978 |
| Score | Frequency |
|---|---|
| 1 | 87 |
| 2 | 150 |
| 3 | 426 |
| 4 | 845 |
| 5 | 1131 |
| 6 | 1348 |
| 7 | 1252 |
| 8 | 546 |
| 9 | 150 |
| 10 | 43 |
| Mean_Score |
|---|
| 5.637337 |
| Median_Score |
|---|
| 6 |
| Standard_Deviation |
|---|
| 1.716367 |
| Minimum_Score |
|---|
| 1 |
| Maximum_Score |
|---|
| 10 |
| Skewness_of_Distribution |
|---|
| -0.2647954 |
| Kurtosis_of_Distribution |
|---|
| -0.1189327 |
I removed the missing data (NA) for the plot…
| Program.Committee.Ratings.Submitted |
|---|
| 5103 |
| Score | Frequency |
|---|---|
| 1 | 77 |
| 2 | 137 |
| 3 | 393 |
| 4 | 766 |
| 5 | 989 |
| 6 | 1140 |
| 7 | 1070 |
| 8 | 430 |
| 9 | 85 |
| 10 | 16 |
| Mean_Score |
|---|
| 5.532824 |
| Median_Score |
|---|
| 6 |
| Standard_Deviation |
|---|
| 1.680255 |
| Minimum_Score |
|---|
| 1 |
| Maximum_Score |
|---|
| 10 |
| Skewness_of_Distribution |
|---|
| -0.3132788 |
| Kurtosis_of_Distribution |
|---|
| -0.1990434 |
I removed the missing data (NA) for the plot…
Our dataset contains multiple abstracts and multiple raters. Our dataset is unbalanced and not fully crossed (the number of ratings per submission varies) and we have missing data (not every rater scored every paper).
Therefore ICC provides the best way to assess inter-rater reliability for this dataset. A different set of raters is “randomly” selected from a larger population of raters for each submission, therefore we need to use a one-way model to calculate the ICC1,2.
I did an analysis using the iccNA R package3. It allows to calculate ICC from the dataset and looking at all abstracts and all raters without removing missing data.
Here the icc analysis:
## $ICCs
## ICC p-value lower CI limit upper CI limit
## ICC(1) 0.1749588 0 0.1497055 0.2039065
## ICC(k) 0.7520025 0 0.7157122 0.7855233
## ICC(A,1) 0.2106269 0 0.1831294 0.2418616
## ICC(A,k) 0.7923343 0 0.7620897 0.8202889
## ICC(C,1) 0.2222910 0 0.1938148 0.2545296
## ICC(C,k) 0.8034248 0 0.7746565 0.8299969
##
## $n_iter
## [1] 5
##
## $amk
## [1] 14.30144
##
## $k_0
## [1] 14.29921
As you can see the single rater reliability for the 418 abstracts is 17.5%, which is quite poor. High reliability exams (e.g. certifying exam for the American Board of Surgery) require reliability between raters above 80%. However, the averaged rater reliability is much better at 75.2%. This might be the more relevant number in our case (see below).
I then used the ICC function from the psych package4, which also allows to perform this analysis with missing data using a slightly different algorithm.
Here are the results…
## Warning in pf(FJ, dfJ, dfE, log.p = TRUE): pbeta(*, log.p=TRUE) ->
## bpser(a=13969.5, b=33.5, x=0.725943,...) underflow to -Inf
## type ICC F df1 df2 p lower bound
## Single_raters_absolute ICC1 0.1664209 14.57594 417 28006 0 0.1504504
## Single_random_raters ICC2 0.1691943 20.03060 417 27939 0 0.1501878
## Single_fixed_raters ICC3 0.2186656 20.03060 417 27939 0 0.1992761
## Average_raters_absolute ICC1k 0.9313938 14.57594 417 28006 0 0.9233271
## Average_random_raters ICC2k 0.9326520 20.03060 417 27939 0 0.9231814
## Average_fixed_raters ICC3k 0.9500764 20.03060 417 27939 0 0.9442062
## upper bound
## Single_raters_absolute 0.1847605
## Single_random_raters 0.1906060
## Single_fixed_raters 0.2406691
## Average_raters_absolute 0.9390655
## Average_random_raters 0.9412230
## Average_fixed_raters 0.9556591
## Warning in pf(FJ, dfJ, dfE, log.p = TRUE): pbeta(*, log.p=TRUE) ->
## bpser(a=13969.5, b=33.5, x=0.725943,...) underflow to -Inf
| type | ICC | F | df1 | df2 | p | lower bound | upper bound | |
|---|---|---|---|---|---|---|---|---|
| Single_raters_absolute | ICC1 | 0.1664209 | 14.57594 | 417 | 28006 | 0 | 0.1504504 | 0.1847605 |
| Single_random_raters | ICC2 | 0.1691943 | 20.03060 | 417 | 27939 | 0 | 0.1501878 | 0.1906060 |
| Single_fixed_raters | ICC3 | 0.2186656 | 20.03060 | 417 | 27939 | 0 | 0.1992761 | 0.2406691 |
| Average_raters_absolute | ICC1k | 0.9313938 | 14.57594 | 417 | 28006 | 0 | 0.9233271 | 0.9390655 |
| Average_random_raters | ICC2k | 0.9326520 | 20.03060 | 417 | 27939 | 0 | 0.9231814 | 0.9412230 |
| Average_fixed_raters | ICC3k | 0.9500764 | 20.03060 | 417 | 27939 | 0 | 0.9442062 | 0.9556591 |
The individual inter-rater reliability is close to the previous method, around 16.9%. The average rating analysis looks quite a bit better at 93.3%.
Our decisions whether to accept or reject a submission are mostly based on the scoring average. Therefore, the latter likely applies to calculate the intra-class correlation (ICC) in our case. Here a direct quote that supports this:
In studies where all subjects are coded by multiple raters and the average of their ratings is used for hypothesis testing, average-measures ICCs are appropriate. However, in studies where a subset of subjects is coded by multiple raters and the reliability of their ratings is meant to generalize to the subjects rated by one coder, a single-measures ICC must be used. Just as the average of multiple measurements tends to be more reliable than a single measurement, average-measures ICCs tend to be higher than single-measures ICCs. In cases where single-measures ICCs are low but average-measures ICCs are high, the researcher may report both ICCs to demonstrate this discrepancy.1
There is plenty of other analyses that can be done based on the collected data, I am more than happy to run more queries if you have any other ideas what to look at.
A few ideas: - ICC per committee membership - ICC per topic - ICC for abstracts accepted for podium or poster presentations or rejected - Dove and hawk calculations between raters - cross reference between membership interests of abstracts and abstract ratings (using clicks of the virtual meeting as a proxy)
Pretty interesting stuff!
Andreas
Hallgren KA, Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol. 2012; 8(1): 23–34., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032, accessed 5/11/2020
Shrout PE, Fleiss JL, Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979 Mar;86(2):420-8, http://rokwa.x-y.net/Shrout-Fleiss-ICC.pdf, accessed 5/12/2020
Brueckl M, Heuer F, Package ‘irrNA’, https://cran.r-project.org/web/packages/irrNA/irrNA.pdf, accessed 5/11/2020
Revelle W, psych: Procedures for Psychological, Psychometric, and Personality Research, https://cran.r-project.org/web/packages/psych, accessed 5/12/2020