PubMed ID: 38827491
Author(s): Domalpally A, Slater R, Linderman RE, Balaji R, Bogost J, Voland R, Pak J, Blodi BA, Channa R, Fong D, Chew EY. Strong versus Weak Data Labeling for Artificial Intelligence Algorithms in the Measurement of Geographic Atrophy. Ophthalmol Sci. 2024 Jan 26;4(5):100477. doi: 10.1016/j.xops.2024.100477. eCollection 2024 Sep-Oct. PMID 38827491
Journal: Ophthalmology Science, Volume 4, Issue 5, 2024
PURPOSE To gain an understanding of data labeling requirements to train deep learning models for measurement of geographic atrophy (GA) with fundus autofluorescence (FAF) images.
DESIGN Evaluation of artificial intelligence (AI) algorithms.
SUBJECTS The Age-Related Eye Disease Study 2 (AREDS2) images were used for training and cross-validation, and GA clinical trial images were used for testing.
METHODS Training data consisted of 2 sets of FAF images; 1 with area measurements only and no indication of GA location (Weakly labeled) and the second with GA segmentation masks (Strongly labeled).
MAIN OUTCOME MEASURES Bland-Altman plots and scatter plots were used to compare GA area measurement between ground truth and AI measurements. The Dice coefficient was used to compare accuracy of segmentation of the Strong model.
RESULTS In the cross-validation AREDS2 data set (n = 601), the mean (standard deviation [SD]) area of GA measured by human grader, Weakly labeled AI model, and Strongly labeled AI model was 6.65 (6.3) mm2, 6.83 (6.29) mm2, and 6.58 (6.24) mm2, respectively. The mean difference between ground truth and AI was 0.18 mm2 (95% confidence interval, [CI], -7.57 to 7.92) for the Weakly labeled model and -0.07 mm2 (95% CI, -1.61 to 1.47) for the Strongly labeled model. With GlaxoSmithKline testing data (n = 156), the mean (SD) GA area was 9.79 (5.6) mm2, 8.82 (4.61) mm2, and 9.55 (5.66) mm2 for human grader, Strongly labeled AI model, and Weakly labeled AI model, respectively. The mean difference between ground truth and AI for the 2 models was -0.97 mm2 (95% CI, -4.36 to 2.41) and -0.24 mm2 (95% CI, -4.98 to 4.49), respectively. The Dice coefficient was 0.99 for intergrader agreement, 0.89 for the cross-validation data, and 0.92 for the testing data.
CONCLUSIONS Deep learning models can achieve reasonable accuracy even with Weakly labeled data. Training methods that integrate large volumes of Weakly labeled images with small number of Strongly labeled images offer a promising solution to overcome the burden of cost and time for data labeling.
FINANCIAL DISCLOSURES Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
© 2024 by the American Academy of Ophthalmology.