Human adjudication of DR grading enhances machine learning algorithm
Click Here to Manage Email Alerts
Data derived from live adjudication of fundus image grading improved the accuracy of an automated learning system for screening for diabetic retinopathy, according to a study.
In this retrospective analysis, retinal fundus images from diabetic retinopathy screening programs were graded by an algorithm and by U.S. board-certified ophthalmologists and retina specialists. The consensus of the retina specialists was used as the reference standard.
“In this study, we compared adjudication with other types of obtaining ground truth and the type of errors that adjudication identifies. We also show that using a small number of adjudicated DR grades allows for substantial improvements in algorithm performance. The resulting algorithm’s performance was similar to or exceeded that of individual U.S. board-certified ophthalmologists and retinal specialists," Lily Peng, MD, PhD, said.
The researchers measured area under the curve (AUC), sensitivity and specificity to compare the performances of the different forms of manual grading and the algorithm.
The three retina specialists had about a 99% specificity rate, but sensitivity ranged from 74.4% to 82.1%.
For moderate or worse diabetic retinopathy, the algorithm improved from AUC of 0.934 to 0.986 when using the adjudicated consensus of the retina specialists.
Between adjudication by retina specialists and the majority decision of ophthalmologists, 193 discrepancies were noted. Most common among the discrepancies were missed microaneurysms, artifacts and misclassified hemorrhages.
“We feel our study is particularly timely in light of recent developments in medical imaging wherein deep learning is being used to train algorithms that can accurately detect disease from an image on par with trained physicians. Unlike other machine learning techniques, which rely heavily on ‘feature engineering’, where computers are programmed to follow a set of explicit rules, deep learning involves programming computers to learn from a large number of labeled examples, without explicitly defining which features are important. Thus, selection of the right reference standard is critical in building clinically relevant deep learning algorithms. In addition, because obtaining the best reference standard can be resource-intensive, we demonstrate how only a subset of the images (eg, the 'tune') have to be labeled in a resource-intensive manner and yet yield superior results," Peng said.– by Robert Linnehan
Disclosures: Krause reports he is an employee and has stock ownership in Google. Please see the study for all other authors’ relevant financial disclosures.
Editor's note: This article has been updated to include comments from Lily Peng, MD, PhD.