Dog Breed Classification using CNNs
A Udacity Data Science Nanodegree Capstone Project.
There are hundreds of different breeds of dogs and around 400 million dogs in the world. If you saw a cute dog at a park but didn’t know what breed it was, wouldn’t it be great if you could use only a picture of it to classify just that? This is exactly what this project from the Udacity Data Science Nanodegree is aimed at achieving.
Project Overview
This project has taught me to create, train and test a deep learning pipeline, particularly Convolutional Neural Networks (CNNs) and transfer learning, to classify pictures of dogs to their respective breed. CNNs are a class of deep neural networks, most commonly applied to analysing visual imagery. Meanwhile, transfer learning is a machine learning technique that enables data scientists to benefit from the knowledge gained from a previously used machine learning model for a similar task. This is a very powerful method and hence offers a better learning rate as well as a higher accuracy after training.
Business Understanding
In this project, images of both dogs and humans are provided in the initial training dataset. The ask is to design and implement an algorithm that can detect a human or a dog in a provided image. If a dog is detected in the image, the algorithm should also predict the corresponding dog breed. However, the algorithm should provide a resembling dog breed if a human is detected in the image. If neither are present, the algorithm should provide an output that indicates an error.
Performance Metric
The goal was to obtain a model that achieved over 60% in its classification accuracy. I used classification accuracy as the metric to evaluate the performance of my model. This dataset does not have a high unbalanced class, hence classification accuracy is a suitable metric for evaluating performance in this project. More details on class balance follow in the Data Understanding section.
The model was used to predict all dog breeds corresponding to each image in the test set, and then the fraction of images the model correctly classified was calculated. This fraction represents the model classification accuracy. As this is a classification problem, other performance metrics such as the AU-ROC curve and a confusion matrix would have also been suitable to evaluate performance.
Strategy
Udacity provided a skeleton strategy that supported my progress in completing this project. The model construction follows the steps below:
- Import and explore the dog and human datasets
- Create a human face detector using OpenCV
- Create a dog detector using Resnet50
- Create a CNN to classify dog breeds using Keras and evaluate model performance
- Use Transfer Learning to improve the initial CNN and evaluate model performance
- Write an algorithm to test new images
You can access the Jupyter Notebook I created in this Git repository to follow my solution in more detail. This repository also contains other files important towards making this model.
Data Understanding
Before pre-processing and training the data, it was important to explore the dogs dataset. From initial data exploration, I found that the dog dataset provided 8351 images of dogs, each labelled with a dog breed out of a total of 133. To effectively create a dog breed classifier, the images in this dataset were split into training, validation, and test sets following approximately a 8:1:1 ratio.
Meanwhile, the human dataset had 13233 images. The increase in images for humans compared to dog breeds could be to provide an unbiased and varied input of human faces for the model to train on.
Next, I was interested in understanding the distribution of dog breeds in the dog dataset. I first started with exploring statistical properties in the dataset, with results as below:
The mean number of images provided for each dog breed is: 50.225564The dog breed with the minimum number of images provided is:
Breed - Affenpinscher
Count - 26The dog breed with the maximum number of images provided is:
Breed - Yorkshire_terrier
Count - 77
The dispersion of dog breed images is: 11.8191997117The coefficient of variation of dog breed images is: 0.235322
Knowing that the average number of images per dog breed is quite high, this assured me that this dataset has a large sample size per dog breed and hence does not have any explicit biases. With this information and knowing that there are 133 breeds in the dataset, the low standard deviation and coefficient of variation values supported my thoughts as well.
However, It was interesting to see that there is not an identical number of images supplied for each dog breed in the dog dataset. This possibility of a highly imbalanced classification problem could provide issues towards creating the CNN model. To view the distribution of dog breeds in the dataset, I created a bar plot shown in Figure 1 below that visualises the count of images for all 133 breeds in the dataset.
From the visualisation, there is a slight imbalance in classification but not significant enough to cause a classification problem. This is further confirmed through the standard deviation above as well as seeing a significant number of images are provided for each dog breed.
It was also important to have an idea of what type of dog images are provided to be used. This helped view if there were any visual biases in the dataset i.e. most of the images provided are outdoors or most of the images are bright and clear. By printing a few random images in the dataset, we see a variety in image background, which means the model would be fed an unbiased dataset and should be able to test accurately on various types of images.
Creating a Human Face Detector
To detect human faces in images, OpenCV’s implementation of Haar feature-based cascade classifiers was used. It is a machine learning based approach where a cascade function is trained from a lot of positive and negative images, and is then used to detect objects in other images. OpenCV provides plenty pre-trained face detectors. Below is my code instantiating the Haar Cascade Classifier from OpenCV and then created a face detector function to predict if an input image contains a human face or not:
# extract pre-trained face detector
face_cascade = cv2.CascadeClassifier('haarcascades/haarcascade_frontalface_alt.xml')# returns "True" if face is detected in image stored at img_path
def face_detector(img_path):
img = cv2.imread(img_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray)
return len(faces) > 0
The performance of this function was tested by finding the results of these two questions:
- What percentage of the first 100 images in the human images dataset have a detected human face?
- What percentage of the first 100 images in the dog images dataset have a detected human face?
Ideally, 100% of human images with a detected face and 0% of dog images with a detected face should be the goal. However, the function falls short of this goal, but still gives acceptable performance. The function detected 100% of human images with a detected face and 11% of dog images with a detected face.
Data Pre-processing
To create a dog detector, I have used a pre-trained ResNet-50 network with the weights of the model trained on imagenet. Data pre-processing was required to effectively use this model with the provided images.
The images needed to be in 4D to use Keras; having the shape (number of samples, row size of image in pixels, column size image in pixels, number of colour channels). I created a function that loads an image and resizes it to a square image that is 224 pixels wide and 224 pixels high. Next, the image is converted to an array, which is then resized to a 4D tensor. The function paths_to_tensor then stacks the images returned into a 4D tensor with the number of images from training, validation or test datasets depending on which image path is called.
from keras.preprocessing import image
from tqdm import tqdmdef path_to_tensor(img_path):
# loads RGB image as PIL.Image.Image type
img = image.load_img(img_path, target_size=(224, 224))
# convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
x = image.img_to_array(img)
# convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
return np.expand_dims(x, axis=0)def paths_to_tensor(img_paths):
list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
return np.vstack(list_of_tensors)
Getting the 4D tensor ready for ResNet-50, and for any other pre-trained model in Keras, requires some additional processing. The built-in function preprocessing_input from Keras takes the output from the image processing steps above and then reverses the colors to blue green red, which is the order keras expects. Finally, the pixels are normalised based on standards for use with pretrained imagenet models.
The data is now ready to make predictions. The function shown below, after completing the preprocessing steps above, uses the predict function to obtain an array for imagenet’s 1000 classes. By taking the argmax of the predicted probability vector, we obtain an integer corresponding to the model’s predicted object class, which we can identify with an object category through the use of this dictionary.
Creating a Dog Detector
While looking at the dictionary, you will notice that the categories corresponding to dogs appear in an uninterrupted sequence and correspond to dictionary keys 151–268 (inclusive). Thus, to check to see if an image is predicted to contain a dog by the pre-trained ResNet-50 model, we need only check if the prediction function above returns a value between 151 and 268 (inclusive).
We use these ideas to complete the dog detector function below, which returns True if a dog is detected in an image (and False if not).
### returns "True" if a dog is detected in the image stored at img_path
def dog_detector(img_path):
prediction = ResNet50_predict_labels(img_path)
return ((prediction <= 268) & (prediction >= 151))
The performance of this function was tested by finding the results of these two questions:
- What percentage of the images in the human images dataset have a detected dog?
- What percentage of the images in the dog images dataset have a detected dog?
The accuracy of this function is very high, having detected 0% of human images with a detected dog and 100% of dog images with a detected dog.
Implementation
Keras helps simplify building a CNN from scratch. Below is my approach to building a multilayer CNN:
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequentialmodel = Sequential()model.add(Conv2D(6, kernel_size=(4), activation='relu', input_shape=(224,224,3), name='conv2d_1'))
model.add(MaxPooling2D(pool_size=(4), strides=None, name='max_pooling2d_1'))
model.add(Conv2D(12, kernel_size=(4), activation='relu', name='conv2d_2'))
model.add(MaxPooling2D(pool_size=(4), strides=None, name='max_pooling2d_2'))
model.add(Conv2D(24, kernel_size=(4), activation='relu', name='conv2d_3'))
model.add(MaxPooling2D(pool_size=(4), strides=None, name='max_pooling2d_3'))
model.add(GlobalAveragePooling2D(name='global_average_pooling2d'))
model.add(Dense(133, activation='softmax'))model.summary()
The selected CNN architecture includes 3 convolutional layers as well as 3 max pooling layers in an alternating sequence. The addition of a max pooling layer after the convolutional layer is a common sequence used for ordering layers within a CNN that may be repeated one or more times in a given model. This is because the max pooling layers reduce the dimensionality of images, and in turn reduces both overfitting as well as computational load. The penultimate layer is the Global Average Pooling (GAP) layer, this has been included to reduce the spatial dimensions of a three-dimensional tensor. Finally, the last layer had 133 nodes to match our classes of dog breeds and a softmax activation function to obtain probabilities for each of the classes.
After compiling the above model, I fitted the model using the train and validation dataset and then tested the model on the provided test images. The goal was to obtain a test accuracy greater than 1%, and my initial model obtained 7.4163% as its test accuracy. That’s nowhere close to acceptable!
This is where transfer learning came in handy to obtain a much better accuracy.
Refinement
It is possible to train a CNN using transfer learning to reduce training time without sacrificing too much accuracy. So, I used a pre-trained VGG-16 model, provided by Udacity as a starting point. By training this model on the train and validation sets respectively, this model was able to correctly classify dog breeds on 44.7368% of the test set. That’s a massive improvement, but it can be further refined to obtain a better accuracy. The code for my model build and its summary are as below:
VGG16_model = Sequential()
VGG16_model.add(GlobalAveragePooling2D(input_shape=train_VGG16.shape[1:]))
VGG16_model.add(Dense(133, activation='softmax'))VGG16_model.summary()
Next, I decided to use a ResNet-50 model as a new starting point. By going through the same process as with the VGG-16 model, the ResNet-50 model was able to correctly classify dog breeds on 80.8612% of the test. This is a great score to obtain, hence I did not refine the model further. The code for my model build and its summary are as below:
Resnet50_model = Sequential()
Resnet50_model.add(GlobalAveragePooling2D(input_shape=train_Resnet50.shape[1:]))
Resnet50_model.add(Dropout(0.2))
Resnet50_model.add(Dense(133, activation='softmax'))Resnet50_model.summary()
Final Algorithm
The final step was to build an algorithm that takes an image path as an input, and determines if a dog, human or neither are detected in the image.
- If a dog is detected in the image, the algorithm should predict the breed using the ResNet-50 transfer learning model.
- If a human is detected in the image, the algorithm should call that out and predict a resembling dog breed using the ResNet-50 transfer learning model.
- If neither are detected in the image, the algorithm should call that out.
The algorithm as showcased below was built to output the supplied image and its corresponding prediction.
#Import relevant libraries
from IPython.display import Imagedef breed_detector(img_path):
‘’’
INPUT:
img_path → path of image that will be classified
OUTPUT:
dog_detector OR face_detector OR neither → the output given what’s on the image (dog/human/neither)
‘’’
display(Image(img_path, width = 300))
if dog_detector(img_path):
return print(‘A dog was detected on this photo! The predicted breed is a’, Resnet50_predict_breed(img_path))
if face_detector(img_path):
return print(‘A human was detected on this photo! If this human was a dog, the resembling dog breed will be a’, Resnet50_predict_breed(img_path))
else:
return print(‘Neither a dog nor a human was detected on this photo!’)
Results
To test out my algorithm, I thought it would be fun to use images of my friend’s pets and images of me and my friend as human examples!
Meet Nacho in Figure 8, he’s an adorable brown Border Terrier. But did the algorithm predict that?
Yes, success!
However, the same cannot be said for Kika. Figure 9 shows a picture of Kika the Ibizian Hound, but the algorithm predicted that she was a Canaan dog instead.
I was also interested to test out my algorithm with mix breed dogs to find out if the algorithm detects a dog on a mixed breed. If so, does it also predict a breed close enough or related to the mixed breed dog?
Meet Sheeba in Figure 10, a mixed breed known to have Terrier in their genes. The algorithm successfully classified Sheeba as a dog, and even predicted a Terrier breed!
The algorithm also successfully detected a human face; apparently, I resemble an English Toy Spaniel!
Finally, the algorithm provided the correct output when neither a dog nor human image was supplied, but instead an image of Bella the horse!
Model Evaluation & Validation
The following is the accuracy each model received on the test set.
- CNN from scratch: 7.4%
- CNN from VGG-16: 44.7%
- CNN from ResNet-50: 80.9%
These differences in accuracies are caused by the varying number of parameters in each model.
- CNN from scratch: 9415 parameters
- CNN from VGG-16: 68229 parameters
- CNN from ResNet-50: 272517 parameters
Since the accuracy of the models were tested on the test sets, overfitting is not an issue even though the final model has the largest number of parameters.
Reflection
The objective of this project was to create a CNN with more than 60% testing accuracy while training in a GPU.
In my opinion, this project has been a great starting point to understand and practice deep learning. Even so, exploratory data analysis and data visualisations were still key in creating and training an effective machine learning model as these steps help in choosing a suitable performance metric for evaluating the model.
Before building the CNN model, the image data needed to be in the form of a 4D tensor when using Keras. Furthermore, all images needed to be reshaped into the same shape for training the CNN models in batch.
Building CNN models from scratch is made extremely simple in Keras. However, this process is computationally expensive and time-consuming, especially without a GPU. This was one of the reasons one of the many pre-trained models available in Keras and trained on ImageNet dataset was used for transfer learning instead. This massively increased the test accuracy of the model when first using the VGG-16 network and subsequently the ResNet-50 network.
A very interesting thing to note is how powerful transfer learning is in achieving good results with minimal computation. It works best when the project is similar to the project on which the pre-trained model weights were optimised.
The final model successfully obtained more than 80% in testing accuracy. Although the model has only a test accuracy of 80.9%, the outcome of testing the algorithm is positive! Images of dogs are correctly classified by the model, most dogs with known breeds are correctly predicted by the model, and the model can distinguish between a human image and a dog image. When neither of those type of images are supplied, the model also produces the correct output which is saying ‘neither dog nor human’.
Suggested Improvements
Three ways to improve the algorithm are:
- Use a better model fit for face detection; such as OpenCV’s DNN module or VGG Face.
- Increasing the training data and possibly including mixed breed classifications to improve the accuracy of the model. This could lead to multi-classification and make the model more flexible as well.
- Conducting a GridSearch on several Neural Networks will help find the optimal model for face detection as well.