Neural networks are always refered to black-box because it is hard to understand individually each neuron's work and how they interact together.
Here we gonna open this black-box and give some visual explanations to understand how a convolutional neural network is able to "see" things.
To do so, we will work on maybe the most popular convolutional neural network (CNN), the VGG16. Despite its gae, it was first introduced in 20141, it is still used in many cases and applications and keep producing mazing results compared to newer architectures.
The VGG16 (Visual Geometry Group) is composed of 16 layers, 13 convolutional and 3 dense.2
We modify the model to produce 5 outputs (1 per convolutional block).
Since each neuron performs transformation by slidding it kernel through the image, the new outputs will allow us to see our image after those transformations and at several steps in the network.
We will use 2 VGG16, one pretrained on "imagenet"3, the second not trained, so with random normalized weights.
We will compare visually the transformations performed by the 2 models and see how the training affects these transformations.
For each image, we will display 12 output images per convolutional block (images are selected by sorting the sum of their matrix). Since the shape of each image after been through is [X, X, 1], the displayed plots will be in black and white. We are using the default colormap of matplotlib, 'viridis', which variations goes from dark blue to yellow.
Original image4
Output from the 1st convolutional block.
Output from the 2nd convolutional block.
Output from the 3rd convolutional block.
Output from the 4th convolutional block.
Output from the 5th convolutional block.
Output from the 1st convolutional block.
Output from the 2nd convolutional block.
Output from the 3rd convolutional block.
Output from the 4th convolutional block.
Output from the 5th convolutional block.
Original image5
Output from the 1st convolutional block.
Output from the 2nd convolutional block.
Output from the 3rd convolutional block.
Output from the 4th convolutional block.
Output from the 5th convolutional block.
Output from the 1st convolutional block.
Output from the 2nd convolutional block.
Output from the 3rd convolutional block.
Output from the 4th convolutional block.
Output from the 5th convolutional block.
The images below are the aggregation of all the images displayed by each convolutional block.
For exemple, for the block 5, the 512 images have been aggregated together then normalized [0, 255] to be displayed.
Note the difference between the pretrained and not trained VGG16:
On the 3 first blocks, the untrained model displays images that were less modified, closer from the original one.
It is because the untrained kernels are not able to identify edges and sharps. Thus the model have issue to sort the information and prioritize the parts of the image that will help recognize it.
On the last 2 blocks, this lake of prioritization leads to blurry results, the model hasn't identify the important zones and looks everywhere.
For the pretrained model on the opposite, the 2 first blocks quickly identify the sharps and edges of the image which helps the 3 last blocks to focus on the most pertinent areas.
For the dog image, we can clearly see a triangle where the head is located.
For the boat it is mostly focused on the front and the contact between the water and the hull.
This allows us to conclude that a CNN trained with "imagenet" learns to identify patterns more than specific images.
This is why we can use these pretrained models (transfer-learning) and still have great results with images they were not trained on.
The model learns to see things in a way !