Image Recognition: Retinopathy

Image Recognition

Almost every major company now has vested interest in image recognition algorithms: Tesla for autonomous driving, Amazon for product and price comparison, Google for its image search, and of course Facebook for its facial recognition. 

The main purpose of this class of algorithms is to extract useful information automatically. Common examples include: image sorting, and identification; security using biometrics; visual geolocation; and analysis in any field from handwriting to astronomy to sports. 

Convolutional Neural Networks

Covolutional neural networks (ConvNets, also CNNs) are a general class of algorithm that have been used to great effect in image recognition. Other purposes of CNNs include: 1D speech/language processing, 2D image or audio spectrograms, and 3D video or volumetric images. 

For this entry, we’ll discuss a particular application in the medical field known as diabetic retinopathy.

Image Recognition: Retinopathy

This disease is the leading cause of blindness in the working-age population of the developed world and affects an estimated 93-million people worldwide. 

The progression of diabetic retinopathy can be slowed if detected in time, this requires an comprehensive screening method of retinal images for vascular abnormalities. An automated detection system based on machine learning algorithms can save tens of thousands of human-hours, saving cost, time and the sight of potentially millions of people. 

Each eye of the patient is assigned a level between 0 and 4 inclusive (with 0 being no risk, and 4 severe risk). The goal is to assign the category that a particular eye belongs to.

Image preprocessing

Images from the datasets have very different resolutions, aspect ratios, colors, rotations etc. A few examples of the dataset that we are given:

Some samples of the images from the kaggle dataset.

Neural networks typically require a fixed input size, so a preprocessing step is required to make all images uniform. We can use a simple program such as imagemagick (, to converteverything to one color and one size. Throwing out colour information to save processing time (this might be useful to reintroduce to find better accuracy).

Data Augmentation

Data augmentation is useful to avoid overfitting, a common mistake in neural networks used for image recognition purposes. For example, if two cameras were used to take the images with one taking darker images that just happened to have more level 4s than the other would result in fitting to this unwanted characteristic. The model would predict all dark images to be level 4 at a higher rate than is accurate. 

A simple solution is called data augmentation. Taking an image, zooming in/out, rotating, flipping, adjusting the brightness etc will result in additional data points that can guard against overfitting. The amount of data augmentation desired is dependent on processing restraints and other optimisation considerations.

An example of a rotation by 40 degrees, with a zoom of 1.05 is shown below:

Data processing on retinal images.

In particular, this is useful for underrepresented categories. Consider an extreme case where 99% of the data is category 0, and 1% of the data is category 4. In this case, an algorithm could just predict all new data to be level 0, with only a 1% misclassification rate. To fix this issue, we need can create roughly equal size sample sets for each level of risk and run our Convoluted Neural Network (ConvNet or CNN) on this.


Now that we have processed the data to how we'd like it. We simply feed it through a ConvNet.


The exact structure of the neural network is somewhat arbitrary and is defined through the computation of the \(\kappa\)-statistic on a validation test set. 

Using a few convolutional and pooling layers, we can now simply feed in the training images. This can be done using ready made modules in Theano, which outputs a table with the model's classification results.


We can test the accuracy of our diagnosis using a weighted kappa \( \kappa \), which measures the agreement between two raters. First an \(N \times N\) histogram matrix \(O\) is, such that \(O_{i,j}\) corresponds to the number of images that receive a rating \(i\) by \(A\), and \(j\) the rating by \(B\). Further, an \(N\times N\) matrix of weights is constructed \(w\), with elements defined as

$$w_{ij} = \frac{\left(i-j\right)^2}{\left(N-1\right)^2}$$

Finally, an \(N\times N\) matrix, \(E\) is computed based on the expected agreement of rater \(A\) and \(B\) from chance alone. This is simply the outer product of the two rater's (i.e. \(u^T v\)) histogram vector of ratings (normalised such that \(E\) and \(O\) have the same sum. 

In medical literature the agreement of the two observers is then measured as

$$\kappa = 1 - \frac{\sum_{i,j} w_{i,j} O_{i,j}}{\sum_{i,j} w_{i,j} E_{i,j}}$$

Typically, in the medical field, we interpret the scores as:

The Kappa Statistic - AJ Viera 

Using some default settings, we get a \(\kappa\) score of 0.72. There is some room for improvement with optimisations, but this brief introduction already shows the power of convolutional networks used for image processing.

An example of a high risk candidate is shown below (left). You can immediately the differences between this and a healthy candidate identified by the algorithm (right).

(Left) Image of a high risk candidate as identified by the ConvNet. (Right) Image of a low risk candidate.


There is still some room for improvement. There are quite a few cases of misclassified level, these can be cleaned up by using better preprocessing of the images.

An misclassified example looks similar to this:

Left and right retinal images for a misclassified example. Classified as Category 4, where true category is level 1.

This looks to be some simple camera artifacts seen as stripes on the right-hand-side of these images. Other possible improvements to remove such anomalies could use an intelligent Fourier transform to filter out some of the lower level noise might work better, or using information of pairing eyes of patients might be able to identify one-off camera artifacts such as these.

Other potential improvements come from the clever use of a pseudo-labelling technique that came from ≋Deep Sea≋team, which used the predictions from other (ensembles of) models to regularise new models.

Final Thoughts

Image recognition software is useful in a huge variety of fields. This entry looking at medical data to quickly and accurately identify risk categories provides just one simple example of the power of these machine learning techniques. 

A very similar piece of software, with minimal adjustments, could easily be used for facial recognition for security purposes or even for price matching software for consumer products.