Disease Detection in Human Eyes Through Machine and Deep Learning
Authors: Rijul Nanda, Evan Bausbacher, Youchan Lee, Corey Karnei, Colby Janecka
Processing of medical images has been a steadily growing field of study in the past decade. Many researchers, data scientists, and machine learning experts and enthusiasts have developed strong models to learn features of these images with deep learning, and have contributed to improve medical imaging techniques. In this post, models written to learn features of diseased and normal fundus images of human eyes are discussed, including accuracy rates of disease detection and future work. This disease detection is split into binary classification (of one pathology compared to the normal), as well as multiclass models that attempt to classify an image into one of seven disease categories. Our binary classification (individual) models present accuracies of greater than 90% on the training set, and above 80% on the testing set. Our best multiclass model presents an 87% accuracy on the entire dataset, with an F1 score of 90%.
The eye pathologies of interest for this detection are cataracts, diabetic retinopathy, age-related macular degeneration, hypertensive retinopathy, glaucoma, and myopia.
For background on these pathologies, a small explanation of their presentation is explained below:
- Cataracts is a pathology that primarily affects the lens of the eye. Specifically, the lens of the eye presents as more clouded than the normal lens. The clouding of the lens also clouds the vision of the patient
- Diabetic retinopathy is a pathology that presents as a complication of diabetes, the result of either a lack of insulin or insulin insensitivity. That is, the hormone responsible for glucose regulation does not function as expected. This results in elevated blood sugar in the body; in the eye, the patient presents with black spots in the field of vision due to blood vessel blockage in the retina.
- Age-related macular degeneration (AMD) is a visual disruption that affects central vision. Specifically, AMD creates visual disruptions in fine detail discernment. AMD presents in wet and dry forms, characterized by the presence of of drusen, a fat, in the macula (in the dry case) or formation of blood vessels below the macula, causing an increase of fluid in the retina (in the wet form).
- Hypertensive retinopathy, similarly to diabetic retinopathy, is a complication related to a familiar pathology. In this case, hypertension. For patients with presenting with this, blurred vision, headaches, and diplopia, which is double vision, are common.
- Glaucoma is defined by increased pressure within the eye, which reduces the ability of the patient to view objects in their periphery. This increased pressure is due to damage to the optic nerve, and may be due to fluid build-up in the eye.
- Myopia is commonly known as near-sightedness, and is an extremely common refractive error of the eye. This pathology is caused by improper focusing of light onto the front of the retina, rather than directly on it, and is frequently corrected for with eye glasses.
The most significant portion of data was sourced from the ODIR-5K dataset, available here: https://www.kaggle.com/andrewmvd/ocular-disease-recognition-odir5k
An immediate problem with the dataset is the lack of images for some diseases. There is a significant number of images for normal undiseased eyes, but a fewer number of images representing hypertensive retinopathy, for example. To rectify this, more images were sourced to increase the number of images in the dataset, specifically for those diseases that are underrepresented.
The sources for the the images include to augment the dataset are the following:
Additional data for glaucoma: https://www.kaggle.com/sshikamaru/glaucoma-detection/version/2
Additional data for diabetic retinopathy: https://www.kaggle.com/c/diabetic-retinopathy-detection/data
Additional data for hypertension:
https://retinagallery.com/thumbnails.php?album=731
Following the augmentation of the dataset, the first attempt at deep learning was made. The first approach was to perform classification of each disease pathology in comparison to normal fundus images. That is, the dataset was split into individual diseases, including a category for normal fundus images, and deep learning on the subsets was performed to classify images as diseased or normal.
These models are referred to as individual models, as they compared and classified images of one disease to images of normal eyes. Classification was performed with convolutional neural networks (CNNs). CNNs are multilayered neural networks that perform convolutions in at least one step in the network. CNNs may be pre-trained on certain image sets, or may be manually trained.
For this task, 3 CNN models were utilized: ResNet50, VGG19, and AlexNet. VGG19 is the only completely pre-trained model, while ResNet50 features a mixture of pre-trained layers and trainable layers, and AlexNet is fully manually trained.
In the following section, the individual models are described for each disease.
Individual Models
Since the data was taken from Kaggle, many users have submitted notebooks displaying their work in disease detection for this dataset. Most of these notebooks are concerned with performing cataract detection in comparison to normal images. Knowing this, our model for cataracts detection nearly matches the accuracy of the best performing cataracts model, as the best performing model has a 100% training accuracy, with a 99.04% training accuracy. All other models surpass the best performing models available in terms of training accuracy, testing accuracy, or both.
Cataracts
The best performing model on this subset was the pre-trained VGG19 model. The other model run was AlexNet, which presented slightly weaker results.
AlexNet peaked to an accuracy around 80% overall after training for 15 epochs. Below is the architecture of AlexNet utilized for all individual models, with the first test run on the cataracts subset (note that this code was edited from a tutorial to fit our needs. The tutorial followed is here: https://engmrk.com/alexnet-implementation-using-keras/) :
Below shows the results of ten random samples of AlexNet predictions:
For VGG19, our accuracy was much higher than the other model, AlexNet, with a peak training accuracy of 99.04% and peak testing accuracy of 94.49%. A random sample of predictions from VGG19 is displayed below:
In our AlexNet models, we display one overall accuracy, while VGG19 indicates both the training and testing accuracy.
When examining the accuracy of AlexNet, the results are not very satisfactory, as other models run by others were better with accuracies of at least 90%. VGG19, however, performed much better and was able to approach near perfection in its predictions. The VGG19 model’s performance matches the results of other observed models found online, but does not surpass other models. Improvements to this model could be made in the dense layers of the CNN by decreasing the size of the layer, as well as potentially modifying the batch size of each training step.
Diabetic Retinopathy
The best performing model was the pre-trained VGG19 model. The other model trained on this set was AlexNet, which performed significantly worse.
Training for diabetic retinopathy (DR) was completed in a similar manner to cataracts, as discussed above. That is, AlexNet was first run on the subset of data to establish a possible baseline, and to hopefully improve our manually written model. The below ten sample images, displaying the predictions from this model, capture the basic performance of AlexNet on DR.
As noted before, AlexNet only captures the overall accuracy and it again did not prove to be a great model for classifying these images on this specific disease. AlexNet did not do much better than what guessing would have done as a 50% accuracy would be the expected value of accuracy achieved by randomly guessing between two different categories. The code, along with the accuracies of the last two epochs, is displayed below:
To improve our model for DR, another VGG19 was written for this image set. The model was trained on ImageNet. A random sample of ten images gives the idea of how VGG19 performed for DR.
As seen from the above, VGG19 outperformed AlexNet by a huge margin as it peaked at 94.48% in its training accuracy and 81.97% for its testing accuracy. In comparison to others who have run this on VGG19, our VGG19 performed .95% better on its testing accuracy and 16.6% better on its validation accuracy. Overall, this performance indicates a strong classification for the DR data set, as accuracies surpass other models written, while providing the pre-established desirable result of an accuracy of above 80%. This model may be improved in future work by tuning the test/train split of the data, as well as attempting to train for longer. Additionally, different batch sizes, thus increasing or decreasing the number of steps taken per epoch, may improve results.
Age-Related Macular Degeneration
The best performing model was a mostly pre-trained ResNet50. We also attempted to train utilizing a completely pre-trained VGG19. Both CNNs include pre-training on ImageNet.
For age-related macular degeneration (AMD), VGG19 was first utilized as it proved to be the superior model for cataracts and DR. The ten sample pictures below display how VGG19 performed for AMD:
The accuracy results from training VGG19 on this dataset, which was performed in a similar manner to the screenshots provided for DR and cataracts, are displayed below:
As the screenshots above indicate, VGG19 proves to be heavily unreliable in being able to classify AMD from normal fundus images. Both its training and validation accuracies peaked at around the 50% mark, which is no improvement upon random guessing. In fact, the validation accuracy displays that random guessing would, on average, be better than VGG19 on this subset of data.
For an improvement of AMD classification, a ResNet50 model was written to train on this subset. The randomly sampled images below showcase the performance of ResNet50, as well as a screenshot of accuracy results:
The screenshots indicate an extremely accurate classification model through ResNet50 as the peak training accuracy was 95.33% and the peak testing accuracy was 89.60%. In comparison to VGG19, it performed much better and could be closer to perfect if trained over a larger number of epochs. Additionally, tuning the model on learning rate may also improve results. These results overall indicate strong classification on this subset of data, and thus a strong model overall with the partially pre-trained ResNet50 model.
Hypertensive Retinopathy
The best performing model was the partially pre-trained ResNet50. The other model run was the completely pre-trained VGG19.
VGG19 was first run on the subset for hypertensive retinopathy (HR). The training accuracy almost reached perfection, while the testing accuracy peaked at 66.45%. This indicates potential overfitting to the dataset. The sample of ten below shows the general performance of VGG19 on HR:
The accuracy results of the last two epochs are displayed below:
Due to this overfitting, a different CNN was utilized in the hopes that the error would be corrected. Thus, a ResNet50 model was utilized. This ResNet50 model was also pre-trained on ImageNet, but also featured some layers that were not pre-trained. Note that the layers of choice that were not pre-trained were initially determined through testing. Different numbers of layers from the end were set as trainable, and the model was run for 15 epochs. After accuracy halted growth, testing ended. A random sample of results of the best ResNet50 model’s predictions are displayed below:
A screenshot of results for the last two epochs is displayed below:
While the table above indicates a slightly worse training accuracy, the ResNet50 model performed much better than the VGG19 model in not overfitting as the testing accuracy was better by 27.27%. In comparison to other models trained on this image set, our ResNet50 model fared much better with a peak training accuracy that was around 4.16% worse while our peak testing accuracy was around 16.8% better. This suggests that our ResNet50 did not overfit nearly as much, and is thus an improvement upon existing models.
Glaucoma
The best performing model was the mostly pre-trained ResNet50. The model run was a manually implemented AlexNet.
To do glaucoma classification, ResNet50 was first utilized. ImageNet was once again the dataset used for pre-training ResNet50, and the same ResNet50 model trained for hypertensive retinopathy was repeated for this subset of the data. Ten random sample images of the model’s predictions below showcase its general performance:
As the screenshot above indicates, ResNet50 had a training accuracy around 90% and a testing accuracy around 80%. The F1 score on the validation set, however, indicate poor performance on precision and recall. This was potentially not the best model that could be run, as the previous individual models serving to classify diseases performed with some better success. For the potential of improvement, the glaucoma set was then predicted on by AlexNet and the ten images show its varying degrees of success:
Through the screenshot above, AlexNet performed much worse than that of ResNet50 as its overall peak accuracy was 65.71%. This fact indicates that AlexNet once again struggled to classify these images based on disease to a reliable degree. Thus, the ResNet50 model performed best on this subset of data. For context, when comparing our ResNet50 model to other models, our model performed worse by 3.16% on its training accuracy, but performed better by 1.95% in its testing accuracy, indicating that the ResNet50 model was slightly better than the existing models. This classification of our model as better than others stems from improvements upon the testing accuracy of the best model written on this data. To improve this model further, a better test train split may introduce less error in the F1 score, and a longer training time through a larger number of epochs may improve accuracy.
Myopia
The best performing model was the pre-trained VGG19 model. The other model run was one final attempt at AlexNet.
AlexNet was run in hopes of finally achieving strong results. A random set of ten predictions below indicate once again display the unreliable state of AlexNet with the classification of our diseases:
The results above are once again unsatisfactory. The screenshot above indicates accuracies that are only slightly better than the expected result from guessing. This clearly showcases that AlexNet was not very successful in classifying the myopia set, just as it had fared poorly on all other individual models. To ideally improve upon these accuracy rates, a pre-trained VGG19 model was written. A sample of the resulting images from the predictions is displayed below:
The accuracies from the last two epochs is displayed below:
In comparison to the best model found online for this specific disease, our peak training accuracy was 3.33% better while our peak testing accuracy was slightly worse by 2.54%. These results suggest while our VGG19’s classification was slightly better on the training set, it did not classify as well on the testing set. Improvements on this model can be made to improve the accuracy by training for longer, as well as improving the test-train split.
Individual Model Takeaways
Overall, the individual models performed quite strongly. In comparison to others who have performed image classification on this dataset, both the training accuracies and testing accuracies provided here either nearly match or exceed the training accuracies of the existing best models, as discussed above. Improvements upon these results could be made by training for longer, and also by tuning further these models for learning rates and test/train splits. Though many of the models feature a callback that would reduce the learning rate after monitoring the validation accuracy, starting from a slightly lower learning rate may improve results overall. Additionally, a test/train split of greater than 20% for testing may improve results as well.
These models indicate that CNNs perform individual classifications with a high degree of accuracy. Further, pre-trained models present better results than manually trained models, while also decreasing the amount of time spent running the model. That is, even without GPU support, pre-trained models were an improvement on time spent waiting for models to train. With GPU support, pre-trained models were even faster than before.
With these individual models, our hopes would be to allow medical practitioners to confirm diagnoses after they have been made. For example, after classifying a patient with a possible case of cataracts after performing a fundoscopic exam, a doctor could scan the fundus image through our accurate VGG19 model to confirm the result. The next steps would be to allow these models to train for a larger number of epochs, tuning the test/train split on the data, and tuning the parameters of these models to achieve a 100% accuracy for all models. These results are not inconceivable, and would be ideal when utilizing deep learning in patient care.
Multiclass Models
The next class of models written are multiclass models. These models perform predictions on the dataset as a whole, rather than individual classification.
When initially performing this classification, the accuracy rate achieved peaked around 30–40%. Though this accuracy is not poor, as it is a significant improvement upon random guessing (which would yield an expected accuracy of 14%), this accuracy reflects possible issues with the dataset as a whole.
The first attempt to correct errors in the dataset was to inspect the images. As noticed, the images were not all of the same size:
This presents a problem when training on the entire dataset: since the overall size of the dataset is large, with several different categories of images, having images of different shapes will introduce error in the training. The different sizes may be introduced in the images by the images included through different online sources, or by the original images themselves. Overall, these images may be noisy and thus may be a cause of lower accuracy than desired.
To ensure standardization of this dataset, a script was written to crop images to be of the same size. By performing this cropping, we ensure that the model trains on images of all similar dimension. Though this does not ensure the images are all of the same quality or resolution, the physical dimensions of the images are standardized to decrease noise in this regard. With this new cropped dataset, the multiclass models were rerun, and the results are displayed in the section below.
For comparison of the results displayed below, multiclass models written to learn on different datasets of fundus images are utilized. There are no multiclass models written on this dataset that were found, thus this multiclass attempt is unique; however, comparisons can be made with caution to models on other fundus image datasets. These models boast accuracies of 80% or better.
The major differences between our models and the models trained on other similar datasets includes size, as our dataset is significantly larger; a larger number of classes, as our dataset features seven classes while others may feature fewer; and a broader range of pathologies, as other models were written to train on different presentations of the same pathology, such as severity of glaucoma.
The models utilized here are XGBClassifier, AlexNet, ResNet50, and RFClassifier. XGBClassifier and RFClassifier are standard machine learning algorithms that do not employ the same CNN architecture as the others do, and are not deep learning algorithms.
AlexNet
The AlexNet multiclass model was written to train on all seven classes of images, which includes all of the pathologies and normal. The model was run for 30 epochs, and reached a final accuracy of 46.44%. A random sample of the images output as a result of this model are displayed below:
An accuracy rate of 46.44% presents a significant improvement upon random guessing; however, this accuracy leaves room for desirable improvements. Though our dataset is unique, we aim to improve our model to reach similar or better accuracies than presented by others. Additionally, since AlexNet is a completely manually trained model, we expected an improvement by utilizing pre-trained models. Thus, our next model was a mostly pre-trained ResNet50 model.
ResNet50
ResNet50 features a mixture of pre-trained and trainable layers. Following a tutorial (https://bit.ly/2VRYK5g), we trained the last 22 layers of ResNet50, while all others were pre-trained on ImageNet. A sample of the resulting images is displayed below:
The accuracy rate of 83.34% for the training set and 72.82% accuracy rate on the testing set displays a significant improvement on our previous AlexNet model. Note that the screenshot below represents the results of the ResNet50 model after 100 epochs had been run and saved. The saved model was loaded and rerun for 20 more epochs, resulting in the highest accuracy achieved yet. This improvement may be attributed to multiple factors, including a larger amount of time spent training, and utilizing the ImageNet database for pre-training.
Despite this improvement, our aim continued to be to achieve an accuracy rate that significantly surpasses 80%. This goal was set to outperform other multiclass CNNs that have been written by researchers and others on Kaggle. Due to the fact that CNNs require a great deal of power and time to train, our aim shifted to standard machine learning classification models rather than deep learning CNNs.
XGBClassifier
XGBClassifier is a simple classifier that can be run with user-introduced parameters as a result of tuning, or with default parameters. For this dataset, a simple XGBClassifier was implemented to determine an initial value of accuracy:
The accuracy rate of 50.62% displays a result that is not ideal. Our aim with classifiers was to reduce training time, but also to increase accuracy. Once again, though this accuracy is an improvement upon random guessing, the results are not an improvement upon the CNNs previously utilized, and not what is desired. As such, since the initial accuracy without any training was lower than desired, a different model was written rather than attempting to tune this model.
RFClassifier
RFClassifier is a machine learning algorithm that implements the Random Forests architecture. Random Forests utilizes decision trees that average results to improving prediction accuracies. This model proved to be our best model. The accuracy achieved was 87% — a significant improvement upon all previous multiclass models.
A sample of images generated from predictions of the RFClassifier is shown below:
The screenshot below displays the result of the accuracy and F1 score after running this model:
To further validate these results, an F1 score was calculated. An F1 score displays the effects of false positives and false negatives on the classification, instead of only true positives and true negatives like accuracy measurements display. The F1 score achieved was 90% — displaying that the results were in fact strong with this model. This model presents the strongest out of all multiclass models, though with tuning some of the parameters, the results did not improve.
Final Takeaways
The RFClassifier presented the strongest results on classification, as determined by its high accuracy. Others who have performed multiclass classification on similar datasets have achieved accuracies above 80%, and some have reached 100% accuracy; however, these models were written for either smaller datasets, for classification of different presentations of the same disease, or for a smaller number of classes overall. Our results thus display strong results for accuracy on the entire dataset, as an 87% accuracy on nearly 8000 different images presents a significant improvement upon random guessing, as well as a significant improvement upon all previous multiclass models written.
With the results from our modeling, future work could involve developing scripts and potentially phone and computer applications that doctors could utilize to quickly confirm or make diagnoses on fundus images. As previously mentioned, the individual models can be utilized to confirm diagnoses that has already been made, while the multiclass models can perform the diagnoses themselves. To do this, the models would need to be saved and scripted for ease of use, and to ensure that training would not be necessary before each use. This modeling thus has significant implications for the improvement of patient care in the future, as deep learning techniques learning features of fundus images can perform classifications to aid critical decision making.
References
When referencing models written by others, we are referencing the following sources:
- https://www.kaggle.com/taha07/cataract-prediction-using-vgg19
- This notebook provided our initial inspiration to attempt classification with VGG19, and also served as a reference accuracy to compare against after writing our VGG19 model. - https://www.kaggle.com/mateuszbagiski/odir5k-predicting-from-extracted-features.
- This notebook features accuracy scores for many different individual models. Since this is the only notebook and attempts at writing individual models that was found on Kaggle, this notebook served as a reference for many of our individual models to compare our accuracies against in order to contextualize our accuracies. - http://www.ijstr.org/final-print/mar2020/Classification-Of-Retinal-Fundus-Images-Based-On-Alexnet-And-Transfer-Learning.pdf
- This paper presents the results of writing an AlexNet model to learn the traits of 1000 fundus images in many different categories. This paper provided a reference for our multiclass models, as peak accuracy achieved for this dataset was 87.5%. Our best model surpassed this. - http://journal.uad.ac.id/index.php/TELKOMNIKA/article/download/14868/8066
- This paper presents the results of training on a large dataset of different presentations of diabetic retinopathy only. The accuracy achieved here was between 80–100%, though the dataset featured only fundus images of different levels of presentation of diabetic retinopathy. Despite this, our multiclass models predicting on a different dataset of several different disease classifications achieved similar accuracy rates to this research. With the results presented here, it is clear that improvements can be made to our model to further increase accuracy to conceivably achieve 100% testing accuracy.
Other sources of inspiration, including tutorials followed for ResNet50 and VGG19 include the following:
- https://towardsdatascience.com/transfer-learning-in-tensorflow-9e4f7eae3bb4
- This tutorial aided in the formation of our initial VGG19 model. - https://engmrk.com/alexnet-implementation-using-keras/
- This tutorial explained how to implement AlexNet as a model. - https://bit.ly/2VRYK5g
- This tutorial explained ResNet50, and its implementation as a partially pre-trained CNN. - Other sources include Kaggle notebooks, as referenced above, that were utilized as a reference to compare our results to. Additionally, these notebooks provided inspiration as to which CNN architectures to utilize for this task, specifically VGG19.