Last time, we explored how a simple MLP neural network could be used to classify the MNIST dataset. Today, we will work on a messier problem. We will use a modified version of the Stanford dogs dataset to train a neural network that can classify dog breeds. Since inter-class variations are small, and an obscure detail could be the deciding factor, we will need a model that can capture more detail. This is where convolutional neural networks (CNN) come in.
As always, we will start by explaining some of the high-level concepts. You can follow along with the code here.
Where the MLP Network Falls Short
Recall that before feeding MNIST images into our MLP, we converted the two-dimensional matrix of pixels into a one-dimensional vector. Doing so means we lost all the spatial information in the image. It’s like trying to play tic-tac-toe with a single layer of boxes. The reason this worked out with the MNIST dataset was that all the images were already cleaned and standardized, but if the numbers in the images were different sizes or in different locations, the model would fail.
Another issue with MLPs is that they use fully-connected layers (Dense in Keras) which means there will be a lot of parameters to tune. With our simple 28×28 input, we already had upwards of 600k parameters. Computational limits can quickly become a problem.
CNNs solve both these problems. CNNs accept two-dimensional matrices as inputs and retain spatial information. CNNs also break the image down into patches so every pixel doesn’t have to be connected to every single hidden node. A pattern that is found in one corner can just be transferred to another corner without being relearned. For example, if the CNN can learn to find a cat in the top right corner of an image, it doesn’t have to learn how to find a cat in the bottom left corner too.
Filters and Convolutions
To understand the magic of CNNs, first we need to understand the concept of filters and convolutions. Say we wanted to find the edges of an object in a picture represented by a matrix of pixels. A simple solution would be to see if the pixels on one side are different from the pixels on the other side. If there is a line of white pixels next to a line of black pixels, it might be an edge.
Mathematically, we can run a filter over the picture called a kernel. For verticle edges it looks something like this:
[[-1 0 1] [-1 0 1] [-1 0 1]]
To apply this kernel to a patch of pixels, we multiply the pixels by the corresponding value in the filter and then sum everything up. The resulting value is called a convolution. For example, if the values were the same on the left and the right, the resulting sum would be 0, meaning no edge.
By sliding our kernel across an image, we get multiple convolutions that can be combined into a new image. From there, the process can be repeated to find other patterns. To see some examples of different kernels check out the Wikipedia page or see them in action at this blog post.
Where the Magic Happens
A CNN is just like other neural networks, but instead of learning the weights of a linear equation, we are learning the weights of a kernel/filter. Say we wanted to detect human faces in an image. A CNN would slide a kernel across the image and then make a guess as to whether there was a human face. If it got it wrong, it would modify the kernel and try again. By stacking convolutions on top of each other, our CNN can learn the tiny details that make a human face a human face. For example, the first layer could find edges, the second shapes, and the third could find eyes and mouths.
Let’s talk about some of the terminologies we are going to need to build our CNN.
CNNs are built with hidden layers called Convolutional Layers. A convolutional layer does a few things:
- Break the image up into smaller regions (convolutional window).
- Slide the window across the image (stride).
- Run each window through filters and connect them to hidden layer (activation).
A bigger window will find bigger patterns. A bigger stride will reduce the size of the convolution. If our stride extends outside of the image we can use padding to add zeros to the edges. More nodes will increase the number of filters. The most common activation function is ReLU which returns zero if the value is negative and returns the same value if positive.
Along with convolutional layers, there are pooling layers. Pooling layers take convolutional layers as input and for each filter map, either takes the average or the max. This keeps the same number of filters but greatly reduces their dimensionality which helps with overfitting.
We can think of filters as depth. As we move through a CNN, our inputs start off large and after pooling, end small and deep. Normally, a pooling layer should be used after every one or two convolutional layers.
Image Augmentation and Transfer Learning
Before building our CNN to predict dog breeds, we need to quickly talk about some of the tools we will be using.
Image augmentation is a technique for adding noise to our training data. We can take our existing data and rotate, zoom, shear, slide, etc it into new training data. This helps reduce overfitting.
Transfer learning takes a pre-trained neural network and adapts it to new objectives and data. CNNs are powerful but can take a long time to train, so it makes sense to start with the work that has already been done by others.
Building a CNN From Scratch
As noted earlier, the code can be found here. After importing and processing our dataset we can finally build our CNN architecture:
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D from keras.layers import Dropout, Flatten, Dense from keras.models import Sequential # Initalize model model = Sequential() # Convolutional layers model.add(Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(224, 224, 3))) model.add(MaxPooling2D(pool_size=2)) model.add(Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')) model.add(MaxPooling2D(pool_size=2)) model.add(Conv2D(filters=64, kernel_size=3, padding='same', activation='relu')) model.add(MaxPooling2D(pool_size=2)) model.add(Conv2D(filters=128, kernel_size=3, padding='same', activation='relu')) model.add(MaxPooling2D(pool_size=2)) model.add(Dropout(0.5)) model.add(GlobalAveragePooling2D()) # Fully connected layers model.add(Dense(512, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(133, activation='softmax'))
Training the CNN on augmented data gets us an accuracy of about 10%. Not bad for our first try, but now let’s try using transfer learning.
Building a CNN with Transfer Learning
Sure, we could train our CNN above for a couple of weeks and see how we do, but let’s try building on the work of others instead.
Transfer learning works because networks trained on large datasets often already have learned features that are useful for other computer vision problems. For example, a network that can classify dogs from cats probably already learned the features that could be used to classify wolves from cats.
One way to accomplish this transfer is to run the borrowed model, minus whatever classification layer it uses, on our data to get an output. We can take this output (bottleneck) and feed it as input to our customized classification network.
We will be using the pretrained Resnet-50 model. Resnet50 was developed by a Microsoft team and won the ImageNet competition in 2015. Resnet50 manages to solve the vanishing gradient problem that plagued other super deep networks in the past. It does this by introducing shortcuts that directly connect layers with layers later on in the network. In this case, 50 layers deep. We will add our own dog breed classification network to the end of it and see how it does.
# Define our architecture. model_Resnet50 = Sequential() model_Resnet50.add(Flatten(input_shape=train_Resnet50.shape[1:])) model_Resnet50.add(Dense(512, activation='relu')) model_Resnet50.add(Dropout(0.5)) model_Resnet50.add(Dense(133, activation='softmax')) model_Resnet50.summary()
Training this model on our data gets us an accuracy of about 85% in a few minutes of training. Quite an improvement.
Fine-Tuning Transfer Learning Model
We can take it another step by actually modifying the weights of our borrowed network. We accomplish this by loading the borrowed network as a base model, adding our classifier on top, and then freezing all the layers we don’t want to mess with. A very small learning rate should be used since we don’t want to overfit the original model:
from keras.models import Model from keras.layers import Input # Build the base Resnet50 Network base_model = ResNet50(weights='imagenet', include_top=False, input_tensor=Input(shape=(224,224,3))) # Build our classifier model to put on top top_model = Sequential() top_model.add(Flatten(input_shape=base_model.output_shape[1:])) top_model.add(Dense(512, activation='relu')) top_model.add(Dropout(0.5)) top_model.add(Dense(133, activation='softmax')) # Get weights from previously trained model #top_model.load_weights('saved_models/weights.best.Resnet50.hdf5') # Add top model to Resnet50 base tuned_model = Model(inputs=base_model.input, outputs=top_model(base_model.output)) # Set all but last 13 layers to be frozen for layer in tuned_model.layers: layer.trainable = False for layer in tuned_model.layers[-12:]: layer.trainable = True print(layer.name)
Training the model for a short period of time on augmented data gets us an accuracy of 77%. Computational limits played a part here since training 50 layers eats up a lot of memory. It is possible that we could do better by unfreezing more layers and training for longer.
Dog Breed Classifier
Now we put it all together and write an algorithm that accepts a file path to an image and then,
- If a dog is detected, return the predicted breed.
- If a human is detected, return the resembling dog breed.
- If neither is detected, indicate an error.
def breed_detector(img_path): """ Takes in a path to an image and returns whether it is a dog or a human and what breed it looks like. """ # Find breed with helper function predicted_breed = predict_breed_R50(img_path) # Show the image img = cv2.imread(img_path) cv_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) plt.title("Let's see what we have here!") plt.imshow(cv_rgb) plt.show() # Check if dog or human with OpenCV's detector. if dog_detector(img_path): print("That's a dog. It looks like a " + str(predicted_breed)) elif face_detector(img_path): print("That's a human, but it looks like a " + str(predicted_breed)) else: print("I'm not sure what that is. Are you trying to trick me?")<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span>
Running it on some sample dogs:
Running it on some sample humans: