A UK retrospective study evaluated Lunit INSIGHT MMG (Lunit), an AI tool for interpreting screening mammography, using standardized test sets from the NHS Breast Screening Programme’s external quality assurance scheme (PERFORMS). Two software versions (V1.1.7.1 and V1.1.8.1) were benchmarked against 1,254 expert readers using 600 cases comprising 1,200 breasts (328 malignant lesions, 823 normal, and 55 benign).
Used as a standalone reader, the AI assigned malignancy suspicion scores per breast. Diagnostic performance was high for both versions (AUC: 0.93 for V1, 0.94 for V2; p = 0.13). While AI V1 showed higher sensitivity than human readers (87.5% vs. 83.2%, p = 0.12), only AI V2 demonstrated a statistically significant improvement (88.7%, p = 0.04). Both AI versions significantly outperformed humans in specificity: V1 (87.4%) and V2 (88.2%) versus 79.0% for human readers (p < 0.001).
The study concludes that test set–based benchmarking is an effective, scalable method to detect performance drift between AI software updates. Notably, only the latest version (V2) delivered significant gains in both sensitivity and specificity over expert readers, highlighting the importance of routine post-deployment monitoring to ensure patient safety and preserve diagnostic accuracy.
Read full study
Keeping AI on Track: Regular monitoring of algorithmic updates in mammography
European Journal of Radiology, 2025
Abstract
Purpose
To demonstrate a method of benchmarking the performance of two consecutive software releases of the same commercial artificial intelligence (AI) product to trained human readers using the Personal Performance in Mammographic Screening scheme (PERFORMS) external quality assurance scheme.
Methods
In this retrospective study, ten PERFORMS test sets, each consisting of 60 challenging cases, were evaluated by human readers between 2012 and 2023 and were evaluated by Version 1 (V1) and Version 2 (V2) of the same AI model in 2022 and 2023 respectively. Both AI and humans considered each breast independently. Both AI and humans considered the highest suspicion of malignancy score per breast for non-malignant cases and per lesion for breasts with malignancy. Sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) were calculated for comparison, with the study powered to detect a medium-sized effect (odds ratio, 3.5 or 0.29) for sensitivity.
Results
The study included 1,254 human readers, with a total of 328 malignant lesions, 823 normal, and 55 benign breasts analysed. No significant difference was found between the AUCs for AI V1 (0.93) and V2 (0.94) (p = 0.13). In terms of sensitivity, no difference was observed between human readers and AI V1 (83.2 % vs 87.5 % respectively, p = 0.12), however V2 outperformed humans (88.7 %, p = 0.04). Specificity was higher for AI V1 (87.4 %) and V2 (88.2 %) compared to human readers (79.0 %, p < 0.01 respectively).
Conclusion
The upgraded AI model showed no significant difference in diagnostic performance compared to its predecessor when evaluating mammograms from PERFORMS test sets.