Completed as part of MIT 6.S977 Ethical Machine Learning In Human Deployments
Introduction
Problem
This paper addresses the challenges of leveraging medical data in deep learning systems for effective diagnosis and treatment in resource-limited settings, particularly in low and middle-income countries (LMICs). We focus on the analysis of retinal images and propose using vector embedding techniques to convert these images into a compressed, information-rich format that retains essential diagnostic features while optimizing the balance between diagnostic accuracy, computational efficiency, and patient privacy, considering the specific constraints and needs of healthcare providers in LMICs.
Motivation
Diabetic retinopathy is a leading cause of preventable blindness, disproportionately affecting individuals in LMICs [1]. While deep learning systems have shown great potential in enhancing diabetic retinopathy screenings, their implementation in resource-constrained settings remains under studied [1]. Successfully addressing these challenges represents a contribution to both machine learning and global health, in service of equitable access to diabetic retinopathy screenings worldwide.
Background
Dimensionality Reduction Techniques in Medical Imaging Data
The broad consensus in the area of dimensionality reduction for medical image analysis is that these techniques can effectively reduce computational resources while maintaining essential data characteristics. Dimensionality reduction methods have been shown to save storage space, reduce computation time, and improve performance on smaller datasets [2].
Studies have demonstrated the effectiveness of various dimensionality reduction techniques in the context of medical data. For example, an analysis of diagnostic breast cancer data using diffusion maps (DM) successfully reduced the dimensionality from 30 to as low as 1, with only a minor drop in accuracy from 0.933 to 0.908 [2].
The consensus is that non-linear dimensionality reduction (NLDR) techniques, such as diffusion maps, can effectively transform high-dimensional medical data to low dimensions while preserving key characteristics [2].
Another study highlights the potential of using small embeddings (less than 1 MB) for sharing medical data with reduced privacy concerns [3].
While these studies demonstrate the effectiveness of dimensionality reduction techniques in various medical contexts, they lack direct comparisons on a single dataset, which could provide more conclusive evidence of their relative performance and generalizability.
Considerations Specific to Medical Practice in Low and Middle Income Countries
Previous research by Williams et al. [4] has highlighted the "strategic importance of AI in LMICs for addressing critical medical skills and staff shortages, providing access to specialized skills, and empowering nurses and community healthcare workers to deliver services that previously required scarce medical officers," in contrast to the focus of AI in high-income economies which is primarily on improving the quality of care and potentially lowering costs [4]. Additionally, research in Thailand has shown that socio-environmental factors can significantly influence the performance and impact of AI models in LMICs, including social, cultural, and environmental contexts that affect model performance [5]. Therefore, when developing and deploying AI solutions in LMICs, it is crucial to consider the technical aspects of the models and the specific challenges presented by these contexts on the ground.
Despite the growing recognition of the importance of AI in LMICs, there is a gap in the literature regarding the development and evaluation of AI solutions that are tailored to the unique needs and constraints of these settings.
Open Problems and Additional Gaps in Literature
While there has been work on dimensionality reduction, studies have primarily focused on a single solution rather than comparing and evaluating multiple approaches. In the current literature, problems like distribution shifts and differences in training data are presented without in-depth analysis or proposed solutions. There is a significant gap in the literature regarding the measurement and quantification of these tradeoffs in a manner that is both useable and deployable in real-world LMIC settings, particularly when it comes to the application of dimensionality reduction techniques on retinal images and the measurable differences in computational resources once deployed.
Research
Data
The Brazilian Multilabel Ophthalmological Dataset (BRSET) is made up of 16,266 retinal fundus images from 8,524 Brazilian patients, collected between 2010 and 2020 [6]. The dataset includes one macula-centered paired exam per patient. Along with the retinal images, BRSET provides demographic information such as age, sex, clinical history, insulin use, and duration of diabetes. The researchers have labeled images with anatomical parameters of the optic disc, retinal vessels, and macula and quality control parameters like focus, illumination, image field, and artifacts [6]. Additionally, the dataset includes multi-label disease classifications for various retinal conditions, such as diabetic retinopathy, macular edema, age-related macular degeneration, and hypertensive retinopathy.
Sourcing and Labeling
The retinal images in BRSET were sourced from three Brazilian ophthalmological centers in São Paulo [6][7]. Under pharmacological mydriasis, trained non-medical professionals captured the images using Nikon NF505 and Canon CR-2 retinal cameras. The labeling process was carried out by a retinal specialist ophthalmologist, following criteria established by the research group [6]. Demographic and medical features were collected from electronic medical records based on patients' self-reported medical history. While BRSET provides a valuable resource for developing and validating machine learning models in ophthalmology, it is important to consider that the dataset represents a single nationality (Brazilian) and a general ophthalmological clinic patient group [6]. As a result, the data contains an unbalanced disease distribution, with only 7% positive for diabetic retinopathy, and we cannot use this data to explore fairness and privacy related to race.
Figure 1: Demographic overview of patients in the BRSET dataset
Figure 2: Sample images from the dataset. Class 0 shows images of eyes negative for diabetic retinopathy and Class 1 shows images positive for diabetic retinopathy
Approach
Our approach involves leveraging machine learning techniques, specifically focusing on deep learning and transfer learning. We will extract embeddings from the retinal fundus images using various models of different sizes without fine-tuning, such as CLIP [8], ConvNext [9], DinoV2 [10], ViT [11], and RETfound, which of particular interest because it is a foundation model specifically for retinal images [12]. These embeddings will serve as feature representations of the images, capturing relevant information for the classification task.
We use a Support Vector Machine (SVM) classifier to assess the effectiveness of these embeddings in representing the retinal fundus images. The SVM classifier is appropriate for our task, as it can effectively handle the high-dimensional embeddings derived from our relatively small dataset [13]. We will train the SVM classifier using the embeddings as input features and compare the performance of the different embedding methods in predicting diabetic retinopathy from patient photos. This comparison will help us identify the best-performing embedding, which in this case is DinoV2 Small with Registers [14].
After selecting the best-performing embedding, we will fine-tune the DinoV2 Small with Registers model to adapt it to predict diabetic retinopathy with BRSET images. Fine-tuning allows us to leverage the model's pre-trained knowledge while tailoring it to our specific classification problem.
We then explore reducing the dimensionality of the embeddings using UMAP (Uniform Manifold Approximation and Projection) [15]. We will use UMAP to obtain lower-dimensional representations of the embeddings. We then compare these reduced-dimensional embeddings, using them as input features for the SVM classifier, which will be trained to predict the presence or absence of diabetic retinopathy based on the embedding features.
We will evaluate the performance of our models using various metrics, including accuracy, F1 score, and Equal Opportunity Difference. These metrics will help us assess the classification performance and fairness aspects of our approach.
Lastly, we will investigate the privacy implications of our approach by analyzing the ability to predict sensitive attributes, such as sex, from the embeddings and fine-tuned models. This analysis will help us understand the extent to which the embeddings and models capture and retain sensitive information about patients.
Methods
Data Preparation
The BRSET dataset comprises 16,266 retinal fundus images from 8,524 Brazilian patients, associated demographic information, and multi-label disease classifications. Since patients are likely to have diabetic retinopathy in both eyes [16], the dataset was split into training (85%) and testing (15%) sets based on patient ID to avoid data leakage. Additionally, when training, all features except for image embedding features are removed, as the patient data available at sites in our target LIMCs is unknown and may vary between countries.
Embedding Comparison
Various pre-trained embedding models of different sizes were used to extract features from the retinal fundus images without fine-tuning. The embeddings compared included "clip_base" (embedding size: 768), "convnext_base" (embedding size: 1024), "convnext_tiny" (embedding size: 768), "convnextv2_base" (embedding size: 1024), "convnextv2_tiny" (embedding size: 768), "dinov2_base" (embedding size: 768), "dinov2_large" (embedding size: 1024), "dinov2_small" (embedding size: 384), "dinov2_small_registers" (embedding size: 384), "retfound" (embedding size: 1024), "swin_base" (embedding size: 1024), "swin_tiny" (embedding size: 768), and "vit_base" (embedding size: 768).
Figure 3: Embedding methods used in our experiment listed by size of embedding
The performance of these embeddings was evaluated using an SVM classifier, with hyperparameters optimized through grid search, focusing on the F1 score. The SVM hyperparameters used were C (Regularization) = 1, polynomial kernel, and Gamma = 'auto', with class weights adjusted to handle class imbalance. These embeddings were compared based on the resulting F1 score, number of false negatives, and average time to generate an embedding.
Fine-tuning
We selected DinoV2 Small with Registers for fine-tuning to adapt it specifically to predicting diabetic retinopathy with the BRSET dataset. The fine-tuning process consisted of training the model's head (top 3 layers) and freezing the remaining layers. During fine-tuning, we applied the following image preprocessing steps for data augmentation: Random zoom-in of up to 20%, random horizontal and vertical image flips, and changes in saturation and lightness. We used four registers for fine-tuning.
We fine-tuned the model using the preprocessed and augmented retinal fundus images from the training set for 50 epochs. The loss function used was Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss), with a positive class weight based on the class distribution. The optimizer used was Adam (Adaptive Moment Estimation), with a learning rate 1e-6.
The fine-tuned DinoV2 Small with Registers model was evaluated on the test set to assess its performance in predicting diabetic retinopathy. The fine-tuned model's performance was compared to the original, non-fine-tuned model to quantify the improvement achieved through the fine-tuning process.
Dimensionality Reduction
We employed UMAP [13] to reduce the dimensionality of the embeddings to various sizes: 2, 5, 10, 20, 50, 100, 200, and 300. Each reduced-dimensional embedding was then used as input features for the SVM classifier, maintaining the same architecture and hyperparameters as in the embedding comparison step.
Evaluation Metrics
We evaluate the performance of the models using several metrics, including accuracy, F1 score, and number of false negatives, to assess both the approach's classification performance. These metrics are chosen because of our class imbalance and because false negatives have a higher chance of causing harm, such as delayed treatment and worse health outcomes for those affected.
We use equal opportunity difference to measure fairness across embeddings because it focuses specifically on ensuring that the true positive rates are equal across different protected groups. When evaluating a system that predicts a medical condition, equal opportunity concentrates on equalizing the true positive rates across protected groups. By ensuring equal true positive rates, equal opportunity helps prevent scenarios where the model systematically under-diagnoses the condition in certain groups.
Privacy Analysis
We investigate the approach's privacy implications by analyzing the ability to predict sensitive attributes from the embeddings and fine-tuned models. We tested with the target label of patient sex in this case, as other data such as race was not available in the dataset. The ROC AUC metric was used to assess the privacy leakage, with the ideal scenario being a random performance (AUC close to 0.5). We conduct the privacy analysis on the original DinoV2 Small with Registers model, the fine-tuned model, and the UMAP embeddings derived from these models.
Results
Experiment Outcomes
Performance and Efficiency Comparison of Embedding Models
Our analysis of “out of the box” embedding methods reveals several key insights. As shown in Figure 4, The DinoV2 Large model achieved the highest F1 score, followed closely by DinoV2 Small. Despite its excellent performance, the DinoV2 Large model is not the most practical due to its longer embedding generation time. In contrast, DinoV2 Small with Registers emerges as the most efficient model. It matches the small size of DinoV2 Small but is also faster in generating embeddings (Figure 5). Additionally, DinoV2 Small with Registers has fewer false negatives compared to DINO V2 Small (Figure 6).
Therefore, we chose the DinoV2 Small with Registers model for fine-tuning and further experiments.
Figure 4: SVM model results across embeddings. Different sizes of the Dinov2 embeddings all perform best.
Figure 5: Average time for each embedding to be generated from a single image. Smaller embeddings take less time to generate. All embeddings were generated using CPUs. Based on discussions with the BRSET researchers, the time required to generate embeddings, as shown in Figure 4, is the second most critical metric after accuracy.
Figure 6: Number of false negatives predicted by the SVM model per embedding
Performance Improvements with Fine-Tuning
After fine-tuning, our model demonstrated improvements in performance metrics. Specifically, accuracy increased from 0.931 to 0.973, and the F1 score rose from 0.939 to 0.972.
Stability of Fine-Tuned Embeddings with Reduced Dimensionality
Crucially, our findings highlight that fine-tuned embeddings maintain their performance even as the embedding size decreases. This is particularly significant for applications in resource-limited settings, where smaller embeddings can substantially reduce the computational and storage demands.
To illustrate this, we compared the F1 scores of fine-tuned and non-fine-tuned embeddings across various embedding sizes using UMAP for dimensionality reduction (Figure 7). Our analysis revealed that while the performance of non-fine-tuned embeddings deteriorates with smaller sizes, fine-tuned embeddings exhibit remarkable stability.
Figure 7: Embedding Length vs. F1 Score for Fine-Tuned and Non-Fine-Tuned Models. The orange bars represent the fine-tuned embeddings, while the blue bars represent the non-fine-tuned embeddings. As illustrated, fine-tuned embeddings maintain high F1 scores even as embedding size decreases, whereas non-fine-tuned embeddings show a decline in performance with reduced embedding sizes.
Impact of Fine-tuning and Embedding Dimensionality Reduction on Model Performance and Fairness
In addition to evaluating overall performance metrics, we analyzed the distribution of false negatives across different embedding lengths for both fine-tuned and non-fine-tuned models.
Non-Fine-Tuned Model: As shown in Figure 8, the number of false negatives across various embedding lengths appears to fluctuate without a clear pattern. This randomness indicates that the embedding size in the non-fine-tuned model does not have a predictable effect on the incidence of false negatives.
Fine-Tuned Model: In contrast, we see a notable decrease in the number of false negatives as the embedding length is reduced. Initially, the fine-tuned model starts with 45 false negatives, but this number decreases to a range of 29-32 false negatives for smaller embedding sizes. This trend suggests that fine-tuning the model not only improves overall performance but could contribute to a more consistent reduction in false negatives after dimensionality reduction.
Figure 8: Embedding Length vs. F1 Score for Fine-Tuned and Non-Fine-Tuned Models. The orange bars represent the fine-tuned embeddings, while the blue bars represent the non-fine-tuned embeddings. As illustrated, fine-tuned embeddings maintain high F1 scores even as embedding size decreases, whereas non-fine-tuned embeddings show a decline in performance with reduced embedding sizes.
The Equal Opportunity Difference for every embedding size, including the original 384 dimensions, remained below a 15% threshold for both fine-tuned and non-fine-tuned models. This consistent performance across all embedding sizes indicates that reducing embedding dimensionality does not significantly impact the fairness of the models in regard to patient sex when using Equal Opportunity Difference as a measure.
Privacy Analysis Results
Figure 9: SVM model predictions move closer to randomness as dimensionality decreases
Based on discussions with the BRSET researchers, privacy is a crucial consideration in model deployment. To evaluate the potential privacy implications, we analyzed the ROC scores, which measure the model's ability to distinguish between different classes. An ROC score of 0.5 indicates random performance, while higher scores indicate better discriminative power.
Our analysis suggests that, whether or not the model is fine-tuned, dimensionality reduction may help improve patient privacy by reducing the model's ability to distinguish between different classes. In the case of the fine-tuned model, this dimensionality reduction does not significantly hurt the model's performance.
Contributions
This paper makes several contributions to machine learning in the context of resource-constrained environments commonly found in lower and middle-income countries (LMICs).
Firstly, we evaluated the impact of fine-tuning on model performance across various embedding sizes. Our findings indicate that fine-tuned models can outperform non-fine-tuned models in terms of F1 score and false negative rates. This suggests that fine-tuning may play a role in achieving robust performance.
Additionally, we analyzed how embedding dimensionality reduction affects both model performance and fairness. Our results show that embeddings created from a fine-tuned model can maintain high F1 scores and lower false negative rates even as embedding sizes decrease. This is particularly relevant for LMICs, where reducing embedding size can substantially lower computational and storage demands without significantly sacrificing model reliability.
Our results demonstrate that fine-tuning not only improves overall model performance but also ensures that embeddings can be effectively reduced in size without compromising accuracy and F1 scores. By optimizing embedding dimensionality, we can enhance the scalability and accessibility of AI technologies in LMICs, contributing to more equitable technological advancement.
Limitations
Despite the promising results, our study has several limitations that should be acknowledged. Firstly, our evaluation of fine-tuned versus non-fine-tuned models is based on a single relatively small dataset, which may limit the generalizability of our findings.
Additionally, our fairness analysis using the Equal Opportunity Difference metric, while informative, does not encompass all possible dimensions of fairness. Other fairness metrics and considerations, such as demographic parity and individual fairness, should be explored to provide a more holistic understanding of model fairness. Due to the nature of the dataset, analyses around fairness and privacy were limited to patient sex, as race was not available. More research is needed to understand what “protected classes” mean in the context of different countries and cultures.
Lastly, while we have made significant strides in understanding the impact of embedding dimensionality reduction, the practical implementation of these models in real-world healthcare settings remains untested. Pilot studies and field deployments are necessary to evaluate the operational feasibility and effectiveness of our models in actual medical environments, particularly in lower and middle-income countries.
Next Steps
Moving forward, an essential step is to engage with healthcare professionals in resource-constrained settings. Conducting interviews and discussions with doctors and healthcare practitioners will provide valuable insights into the practical challenges they face and their specific needs, which will help tailor the model development and deployment strategies to better serve these areas.
Additionally, if these methods are going to be deployed, both improving the model's performance and understanding potential biases and problems are a priority. Exploring different fine-tuning strategies, optimizing hyperparameters, and integrating additional data sources could enhance the model's accuracy and efficiency. Utilizing saliency maps and other interpretability techniques will help identify features that are lost as embedding dimensionality decreases.
A detailed analysis of the computational and storage resources required by models with different embedding sizes is also necessary. Understanding the trade-offs between performance and resource usage will help optimize the models for deployment in resource-constrained environments. This analysis will provide a clearer picture of how to balance efficiency with effectiveness.
Lastly, during discussions with BRSET researchers, it became clear that ethical and regulatory considerations are critical when deploying AI models in healthcare settings. Issues related to laws governing the transfer of medical data and the constraints posed by internet bandwidth and data size need to be carefully addressed. Ensuring compliance with local regulations and addressing these ethical concerns is crucial for responsible AI deployment.
References
[1] Paisan Ruamviboonsuk, Richa Tiwari, Rory Sayres, Variya Nganthavee, Kornwipa Hemarat, Apinpat Kongprayoon, et al. 2022. Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study. The Lancet Digital Health 4, 3 (March 2022), e158-e168. https://doi.org/10.1016/S2589-7500(22)00017-6
[2] Robert E. Colgan, David E. Gutierrez, Jugesh Sundram, and Gnana Bhaskar Tenali. 2013. Analysis of Medical Data Using Dimensionality Reduction Techniques. Technical Report. https://doi.org/10.13140/2.1.2270.1762
[3] Bram de Wilde, Anindo Saha, Richard P. G. ten Broek, and Henkjan Huisman. 2023. Medical diffusion on a budget: textual inversion for medical image generation. arXiv:2303.13430.
[4] Diarmid Williams, Henry Hornung, Ashish Nadimpalli, and Anne Peery. 2021. Deep Learning and its Application for Healthcare Delivery in Low and Middle Income Countries. Frontiers in Artificial Intelligence 4, Article 553987 (April 2021), 12 pages. https://doi.org/10.3389/frai.2021.553987
[5] Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20), April 25-30, 2020, Honolulu, HI, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3313831.3376718
[6] Luis Filipe Nakayama, Lucas Kirsten Meneghetti, Gabriel Mauricio Ferreira Bianchini, Rogerio Gomes, Julio Cesar Nievola, and Douglas Godoy. 2023. A Brazilian Multilabel Ophthalmological Dataset (BRSET) (version 1.0.0). PhysioNet. https://doi.org/10.13026/xcxw-8198.
[7] Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101, 23 (June 2000), e215–e220. DOI:https://doi.org/10.1161/01.CIR.101.23.e215
[8] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV].
[9] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A ConvNet for the 2020s. arXiv:2201.03545 [cs.CV].
[10] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193 [cs.CV].
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV].
[12] Yukun Zhou, Mark A. Chia, Siegfried K. Wagner, Murat S. Ayhan, Dominic J. Williamson, Robbert R. Struyven, Timing Liu, Moucheng Xu, Mateo G. Lozano, Peter Woodward-Court, Yuka Kihara, UK Biobank Eye & Vision Consortium, Andre Altmann, Aaron Y. Lee, Eric J. Topol, Alastair K. Denniston, Daniel C. Alexander, and Pearse A. Keane. 2023. A foundation model for generalizable disease detection from retinal images. Nature 622, (June 2023), 156–163. DOI:https://doi.org/10.1038/s41586-023-06555-x
[13] Muhammad Mohsin Butt, D. N. F. Awang Iskandar, Sherif E. Abdelhamid, Ghazanfar Latif, and Runna Alghazo. 2022. Diabetic Retinopathy Detection from Fundus Images of the Eye Using Hybrid Deep Learning Features. Diagnostics 12, 7 (July 2022), 1607. DOI:https://doi.org/10.3390/diagnostics12071607
[14] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2024. Vision Transformers Need Registers. arXiv:2309.16588 [cs.CV].
[15] Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 [stat.ML].
[16] Centers for Disease Control and Prevention. 2022. Diabetes and Vision Loss. Retrieved May 10, 2023 from https://www.cdc.gov/diabetes/managing/diabetes-vision-loss.html