Towards Accountable AI in Eye Disease Diagnosis: Workflow, External Validation, and Development.

PubMed ID: 40969489

Author(s): Chen Q, Keenan TDL, Agron E, Allot A, Guan E, Duong B, Elsawy A, Hou B, Xue C, Bhandari S, Broadhead G, Cousineau-Krieger C, Davis E, Gensheimer WG, Golshani CA, Grasic D, Gupta S, Haddock L, Konstantinou E, Lamba T, Maiberger M, Mantopoulos D, Mehta MC, Elnahry AG, Al-Nawaflh M, Oshinsky A, Powell BE, Purt B, Shin S, Stiefel H, Thavikulwat AT, Wroblewski KJ, Chung TY, Cheung CMG, Cheng CY, Chew EY, Hribar MR, Chiang MF, Lu Z. Towards Accountable AI in Eye Disease Diagnosis: Workflow, External Validation, and Development. ArXiv [Preprint]. 2025 Jul 24:arXiv:2409.15087v2. PMID 40969489

Journal: Ar Xiv, Jul 2025

IMPORTANCE Timely disease diagnosis is challenging due to limited clinical availability and growing burdens. Although artificial intelligence (AI) shows expert-level diagnostic accuracy, a lack of downstream accountability-including workflow integration, external validation, and further development- continues to hinder its real-world adoption.

OBJECTIVE To address gaps in the downstream accountability of medical AI through a case study on age-related macular degeneration (AMD) diagnosis and severity classification.

DESIGN SETTING AND PARTICIPANTS We developed and evaluated an AI-assisted diagnostic and classification workflow for AMD. Four rounds of diagnostic assessments (accuracy and time) were conducted with 24 clinicians from 12 institutions. Each round was randomized and alternated between Manual and Manual + AI, with a washout period. In total, 2,880 AMD risk features were evaluated across 960 images from 240 Age-Related Eye Disease Study patient samples, both with and without AI assistance. For further development, we enhanced the original DeepSeeNet model into DeepSeeNet+ using ~40,000 additional images from the US population and tested it on three datasets, including an external set from Singapore.

MAIN OUTCOMES AND MEASURES We measured the F1-score for accuracy (Wilcoxon rank-sum test) and diagnostic time (linear mixed-effects model), comparing Manual vs. Manual + AI. For further development, the F1-score (Wilcoxon rank-sum) was again used.

RESULTS Among the 240 patients (mean age, 68.5 years; 53% female), AI assistance improved accuracy for 23 of 24 clinicians, increasing the average F1-score by 20% (37.71 to 45.52), with some improvements exceeding 50%. Manual diagnosis initially took an estimated 39.8 seconds per patient, whereas Manual + AI saved 10.3 seconds and remained 1.7-3.3 seconds faster in later rounds. However, combining manual and AI may not always yield the highest accuracy or efficiency, underscoring challenges in explainability and trust. DeepSeeNet+ performed better in three test sets, achieving 13% higher F1-score in the Singapore cohort.

CONCLUSIONS AND RELEVANCE In this diagnostic study, AI assistance improved both accuracy and time efficiency for AMD diagnosis. Further development was essential for enhancing AI generalizability across diverse populations. These findings highlight the need for downstream accountability during early-stage clinical evaluations of medical AI. All code and models are publicly available.