alt text

This document contains statistical data regarding abstract evaluation for the APSA program. The current dataset is limited to the 418 abstracts submitted for the 2020 meeting. Future iterations could be expanded to contain data from previous years

This dataset is stratified by individual raters per column, therefore it will now be possible to do the ‘hawks’ and ‘doves’ analysis. This will be done in the future.

The project was launched in an effort to evaluate and possibly increase overall inter-rater reliability for the APSA abstract submission.


Data Analysis

- Descriptive Statistics of the Entire Dataset

Total.Ratings.Submitted
5978
Score Frequency
1 87
2 150
3 426
4 845
5 1131
6 1348
7 1252
8 546
9 150
10 43
Mean_Score
5.637337
Median_Score
6
Standard_Deviation
1.716367
Minimum_Score
1
Maximum_Score
10
Skewness_of_Distribution
-0.2647954
Kurtosis_of_Distribution
-0.1189327

Plot the Dataset (Histogram with Normal Distribution Overlay)

I removed the missing data (NA) for the plot…

- Descriptive Statistics of the Program Committee Dataset

Program.Committee.Ratings.Submitted
5103
Score Frequency
1 77
2 137
3 393
4 766
5 989
6 1140
7 1070
8 430
9 85
10 16
Mean_Score
5.532824
Median_Score
6
Standard_Deviation
1.680255
Minimum_Score
1
Maximum_Score
10
Skewness_of_Distribution
-0.3132788
Kurtosis_of_Distribution
-0.1990434

Plot the Dataset (Histogram with Normal Distribution Overlay)

I removed the missing data (NA) for the plot…

Inter-Rater Reliability (using Intraclass Correlation (ICC))

Our dataset contains multiple abstracts and multiple raters. Our dataset is unbalanced and not fully crossed (the number of ratings per submission varies) and we have missing data (not every rater scored every paper).

Therefore ICC provides the best way to assess inter-rater reliability for this dataset. A different set of raters is “randomly” selected from a larger population of raters for each submission, therefore we need to use a one-way model to calculate the ICC1,2.

I did an analysis using the iccNA R package3. It allows to calculate ICC from the dataset and looking at all abstracts and all raters without removing missing data.

Here the icc analysis:

## $ICCs
##                ICC p-value lower CI limit upper CI limit
## ICC(1)   0.1749588       0      0.1497055      0.2039065
## ICC(k)   0.7520025       0      0.7157122      0.7855233
## ICC(A,1) 0.2106269       0      0.1831294      0.2418616
## ICC(A,k) 0.7923343       0      0.7620897      0.8202889
## ICC(C,1) 0.2222910       0      0.1938148      0.2545296
## ICC(C,k) 0.8034248       0      0.7746565      0.8299969
## 
## $n_iter
## [1] 5
## 
## $amk
## [1] 14.30144
## 
## $k_0
## [1] 14.29921

As you can see the single rater reliability for the 418 abstracts is 17.5%, which is quite poor. High reliability exams (e.g. certifying exam for the American Board of Surgery) require reliability between raters above 80%. However, the averaged rater reliability is much better at 75.2%. This might be the more relevant number in our case (see below).

I then used the ICC function from the psych package4, which also allows to perform this analysis with missing data using a slightly different algorithm.

Here are the results…

## Warning in pf(FJ, dfJ, dfE, log.p = TRUE): pbeta(*, log.p=TRUE) ->
## bpser(a=13969.5, b=33.5, x=0.725943,...) underflow to -Inf
##                          type       ICC        F df1   df2 p lower bound
## Single_raters_absolute   ICC1 0.1664209 14.57594 417 28006 0   0.1504504
## Single_random_raters     ICC2 0.1691943 20.03060 417 27939 0   0.1501878
## Single_fixed_raters      ICC3 0.2186656 20.03060 417 27939 0   0.1992761
## Average_raters_absolute ICC1k 0.9313938 14.57594 417 28006 0   0.9233271
## Average_random_raters   ICC2k 0.9326520 20.03060 417 27939 0   0.9231814
## Average_fixed_raters    ICC3k 0.9500764 20.03060 417 27939 0   0.9442062
##                         upper bound
## Single_raters_absolute    0.1847605
## Single_random_raters      0.1906060
## Single_fixed_raters       0.2406691
## Average_raters_absolute   0.9390655
## Average_random_raters     0.9412230
## Average_fixed_raters      0.9556591
## Warning in pf(FJ, dfJ, dfE, log.p = TRUE): pbeta(*, log.p=TRUE) ->
## bpser(a=13969.5, b=33.5, x=0.725943,...) underflow to -Inf
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.1664209 14.57594 417 28006 0 0.1504504 0.1847605
Single_random_raters ICC2 0.1691943 20.03060 417 27939 0 0.1501878 0.1906060
Single_fixed_raters ICC3 0.2186656 20.03060 417 27939 0 0.1992761 0.2406691
Average_raters_absolute ICC1k 0.9313938 14.57594 417 28006 0 0.9233271 0.9390655
Average_random_raters ICC2k 0.9326520 20.03060 417 27939 0 0.9231814 0.9412230
Average_fixed_raters ICC3k 0.9500764 20.03060 417 27939 0 0.9442062 0.9556591

The individual inter-rater reliability is close to the previous method, around 16.9%. The average rating analysis looks quite a bit better at 93.3%.

Our decisions whether to accept or reject a submission are mostly based on the scoring average. Therefore, the latter likely applies to calculate the intra-class correlation (ICC) in our case. Here a direct quote that supports this:

In studies where all subjects are coded by multiple raters and the average of their ratings is used for hypothesis testing, average-measures ICCs are appropriate. However, in studies where a subset of subjects is coded by multiple raters and the reliability of their ratings is meant to generalize to the subjects rated by one coder, a single-measures ICC must be used. Just as the average of multiple measurements tends to be more reliable than a single measurement, average-measures ICCs tend to be higher than single-measures ICCs. In cases where single-measures ICCs are low but average-measures ICCs are high, the researcher may report both ICCs to demonstrate this discrepancy.1

There is plenty of other analyses that can be done based on the collected data, I am more than happy to run more queries if you have any other ideas what to look at.

A few ideas: - ICC per committee membership - ICC per topic - ICC for abstracts accepted for podium or poster presentations or rejected - Dove and hawk calculations between raters - cross reference between membership interests of abstracts and abstract ratings (using clicks of the virtual meeting as a proxy)

Pretty interesting stuff!

Andreas

References

  1. Hallgren KA, Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol. 2012; 8(1): 23–34., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032, accessed 5/11/2020

  2. Shrout PE, Fleiss JL, Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979 Mar;86(2):420-8, http://rokwa.x-y.net/Shrout-Fleiss-ICC.pdf, accessed 5/12/2020

  3. Brueckl M, Heuer F, Package ‘irrNA’, https://cran.r-project.org/web/packages/irrNA/irrNA.pdf, accessed 5/11/2020

  4. Revelle W, psych: Procedures for Psychological, Psychometric, and Personality Research, https://cran.r-project.org/web/packages/psych, accessed 5/12/2020