Image Recognition and Transfer Learning

- Zijian Han & Sili Wang  [SFL]


For human beings, interpreting what we see is so easy that we are hardly conscious of it; However, for computers, these tasks are very difficult problems to solve.

In this blog, we will briefly introduce image recognition with transfer learning. At its most fundamental, an image recognition algorithm takes images and outputs a label describing the image.

We will classify images from the Caltech 101 dataset with the Open Source Computer Vision (OpenCV) library.

The Data

The Caltech101 computer vision dataset contains pictures of objects belonging to 101 categories.  Examples of categories include airplanes, brains, and saxophones.  Each category contains images of size around 300 * 200 pixels, and the number of sample images per category ranges from 31 to 800, while most categories contain about 50 images.


Open Source Solutions

In our classification problem, we decided to apply a pre-trained network to avoid the huge effort and cost required to train a deep neural network. The trade-off is that training our own model may yield more accurate results as our dataset may well be more representative of the types of images we are trying to classify.

In particular, we used GoogleNet (a convolutional neural network pre-trained on the ImageNet dataset) in the Caffe framework.  ImageNet is one of the most comprehensive open-source datasets containing 1.2 million images subdivided into 1000 categories, large enough for the convolutional neural network to learn variations of diverse images.

OpenCV result for a watch figure  

OpenCV result for a watch figure  

OpenCV result for a airplane figure  

OpenCV result for a airplane figure  

Such “Transfer Learning” is highly suited for proof of concepts and requires only little model setup, allowing for minimum viable product to be released with a very short turnaround.

To test the performance of GoogleNet, we decided to simplify the analysis and selected a subset of the following 10 categories from the Caltech 101 dataset that had good statistics: airplane, brain, car, chandelier, hawksbill, ketch, leopards, motorbikes, grand piano, and watch.

Transfer Learning

Since GoogLeNet was pretrained on the ImageNet data with a different set of class labels, we need to further process the results from OpenCV to serve our target classes.

For each observation, we kept track of the predicted probabilities from OpenCV under each of the classes in a matrix.  For instance, with an input image from the leopard class, OpenCV gives  65%(probability) Leopard, 15% Cheetah, 10% Snow Leopard, 6% jaguar, 4% Egyptian cat and so on.  How we use this information can be tuned in a full analysis, but here we took only the top 5 probabilities to avoid overfitting.

This new matrix of top 5 classes and their probabilities act basically as the features for the next layer of our classifier.

We found that the performance, measured on a holdout set, of the random forest classifier was significantly influenced by the distribution of each class in our training set.  Therefore we tried to keep class balance by randomly subsampling 30 images without replacement from each class to construct the training set, and keeping the rest as the test set. Oversampling, and the machine vision data augmentation techniques will likely allow a fuller analysis to achieve substantially better results.

Performance Evaluation

Model evaluation is one of the most important area in machine learning, we visualize the results and see which classes are well classified and which are not. Based on this evaluation, we find potential improvements to our model and train the model in a different way again.

The confusion matrix showing the classification result for ten classes is shown below. The x-axis runs over the predicted classes, with the vertical the true classes. The probability increases from 0 to 1 as the colors range from green to yellow.


From here, we see that the algorithm is doing quite well in the majority of the classes, although for images of brains, hawksbills, and leopards there is significant ambiguity between the model predictions.

To understand this a little, let us firstly take a look at the probability histograms.

The probability histogram(see below) shows us a clear distribution of the predicted probability for ten classes. Most of the classes are well classified with a high score(>0.5), proving that the classifier is confident with most of its predicted results.

However, the classifier is not very confident with the classification for pictures with labels{“brain”, “chandelier”, “hawksbill”}.

For clarity, the subset probability histogram is shown left. For the classes (brain, hawksbill, and chandelier), the probability distribution indicates that lots of true chandelier/brain  labeled figures have very low probability to be predicted as a “chandelier” or a “brain”.

This reflects what we find in the confusion matrix.

The simplest way to figure out why there are low scoring images here is to take a look at them. Take a look at the example below, this is the openCV output for a misclassified figure of a hawksbill. The highest predicted probability for this hawksbill is a “beaver” with 54.19%. This happens usually because the images which are used to train the GoogLeNet are completely different from Caltech101 dataset in our analysis - and therefore can’t be explained by the GoogLeNet model.


Using several pretrained networks ensembled might rectify this situation by injecting more information on what this animal is, as would training our own deep network.


In general, the evaluation shows that our classification results is amazinging solid using transfer learning. However, there is indeed a lot of improvements we can do in the future. If model accuracy is important and resources were unlimited, training your own neural network would likely be most optimal way if you can get your hands on enough data.  

In this analysis, we just try one model(GoogLeNet) and we may use other pre-trained models and compare the results in the future. Data cleaning, augmenting and processing will allow even more improvements to these numbers.


In this analysis, we used a pretrained convolutional neural network to classify images from the Caltech 101 dataset.  The network GoogLeNet was trained on the ImageNet dataset, which did not have all the same labels that appeared in our dataset. To overcome this inconsistency, we utilized transfer learning by taking the probabilities of the top predicted classes of each input image as new features for a second supervised classifier.  The second classifier learnt the relationship between features and targets (true label names from our own dataset), and gives robust predictive performance in the test dataset with over 90% accuracy rate for most classes.

Transfer learning is highly applicable to a wide range of problems. It allows us to use freely available models to quickly, cheaply, and accurately provide solutions to otherwise very complicated problems.