Image Recognition: Getting Value from Visual Data

In this blog entry, we'll briefly discuss how image recognition software works and how we might use it to gain insight into the rich field of machine vision. For a detailed example of this type of algorithm in action, have a look at our case study on diabetic retinopathy.

Visual Data

Almost since its inception, Facebook has used image recognition to automatically identify people in images. Such technology saves human hours and money, and also increases customer retention by increasing client satisfaction.

It is simple to see the applications of such facial-recognition algorithms in both social and security arenas. Further, to a machine, faces are no different from any other features - thus, such methods can be applied to any machine vision problems, ranging from classifying astronomic data from the Hubble telescope, to the more down to earth venture of designing Tesla's autonomous cars.

Recently there has been huge interest in deep-learning convolutional neural networks that have had huge success in fields as different as digit recognition for zip codes to geolocation.

Convolutional Neural Networks

Convolutional neural networks (CNNs or ConvNets) are a class of biologically inspired algorithms, designed to process data that typically comes in the form of multiple arrays. Each neuron is sparsely connected and tiled in such a way as to respond to overlapping regions of the parameter space. The overlapping of the neurons allows a better representation of the original image, whilst the spare connectivity ensures that each unit is unresponsive to variations outside its local neighbourhood. 

The first few stages of a CNN are typically composed of two types of layers: Convolutional and pooling layers. In brief, the convolution layer uses a set of filters to detect local convolutions of features from the previous layer, whilst the pooling layers merge similar and neighbouring features into one. As you go up the hierarchy of layers, the model learns features that are both increasingly global and invariant to position. Finally, the output of these steps is then fed into a standard fully connected Multilayer Perceptron.


Convolutional Layers

In principle the convolutional layer can be constructed in many ways. The typical method is using a set of traversing filters in the form of a group of pixels, that we use to scan across our input image and computing the similarity of the input image to the filter. 

A few steps of a simple example in the scan of a particular filter (red) over the input image (grey) are shown below [Leow Wee Kheng]. The filters are iteratively learnt using backward propagation. 


Notice that the scan typically starts beyond the bounds of the input image, this is known as zero padding and is useful in keeping the feature map constructed by each filter, the same size as the original input image. 

The final result will be a (convolution) matrix that can be visualized as:


This process is then repeated for all filters. For Facebook, the filters at a particular level will look like the below. These correspond to common features in the images we are scanning across.  

Learning Hierarchical Representations for Face Verification with Convolutional Deep Belief Networks, Gary B. Huang, Honglak Lee, and Erik Learned-Miller, CVPR 2012

The number of these filters per layer is defined by hand, and can be trained by the typical back-propagation methods. 

These convolutional layers are typically followed by "pooling" layers.

Pooling Layers

Subsampling, typically known as pooling nowadays, is basically a form of data compression that combines neighboring pixels in each feature map. This allows us to gain some translational invariance at the cost of information. These layers do no learning themselves, instead, they simply take the input (\N\times N\) layer and output an \(\frac{N}{k} \times \frac{N}{k}\) layer, where each \(k\times k\) block is reduced to a single value by some function; the two most common are max and mean pooling, which as their names suggest pick out either the maximum or the mean value from each \(k\times k\) block. 

After several convolutional and pooling layers, a standard fully-connected neural network is typically used:

The final outcome of traversing up the hierarchy of convolutional layers is that the filters become more and more global. At the bottom level, they may just be lines and dots, then eyes and noses, before finally faces emerge. This depth of representations allows the algorithm to learn the features to great effect.

Data Augmentation

As is the case of most neural networks (especially in deep learning algorithms), there is a large risk of overtraining in CNNs. To mitigate this, there is a commonly used technique known as data augmentation. The general idea is to create pseudodata that act as fluctuations around the images we are learning. These basically act as a data-leveling, stopping the algorithm from learning non-physical noise in the data. A very good example of this data-augmentation technique may be found in our diabetic retinopathy case study. 

Final Thoughts

There is huge scope for convolutional neural networks and such a general idea is not limited to just the learning of faces. Medical image analysis (in digital pathology), autonomous cars, Amazon’s product searching app on your phone, Google’s image search all use visual data processed in basically the same way.