Image Recognition

There are plentiful applications for image recognition algorithms that span almost every industry. The main purpose of these is to extract useful information automatically. Examples include: image searching, sorting, and identification; security using facial features or other biometrics; visual geolocation; analysis in any field from handwriting to astronomy to sports; and even autonomous vehicles. 

Covolutional neural networks (ConvNets, also CNNs) are a general class of algorithm that have been used to great effect in image recognition. Other purposes of CNNs include: 1D speech/language processing, 2D image or audio spectrograms, and 3D video or volumetric images.

Convolutional Layers

In principle the convolutional layer can be constructed in many ways. The typical method is using a set of traversing filters in the form of a small group of pixels, that we use to scan across our input image and computing the similarity of the input image to the filter.

A few steps of a simple example in the scan of the filter (red) over there the input image (grey) are shown below [Leow Wee Kheng].

The final result will be a matrix that can be visualised as:

The number of these filters is defined by hand, and can be trained by typical back-propagation methods. This will be covered in a later post. 

Notice that the scan starts from beyond the bounds of input image, this is known as zero padding and is useful in keeping the feature map constructed by each filter, the same size as the original input image.


Subsampling, typically known as pooling nowadays, is basically a form of data compression that combines neighbouring pixels in each feature map. This allows us to gain some translational invariance at the cost of information. These layers do no learning themselves, instead, they simply take the input \(N\times N\) layer and output an \(\frac{N}{k}\times \frac{N}{k}\) layer, where each \(k \times k\) block is reduced to a single value by some function; the two most common are max and mean pooling. 

After several convolutional and pooling layers, a standard fully-connected neural net is typically used. 

For an example we consider a medical application in retinopathy in our blog here