CNNs are widely used in the field of deep learning, particularly for tasks involving image and video recognition. A convolutional layer is the fundamental building block of the CNN. Multiple convolutional layers are used together in a Convolutional Neural Network (CNN) to create a feature hierarchy, allowing the network to learn increasingly complex and abstract representations of the input data.
Hierarchical Feature Extraction
- Early layers detect simple, low-level features such as edges, corners, and color blobs from the raw pixel data.
- Subsequent layers take the feature maps from the previous layers as input and learn to combine these simple features into more complex, mid-level features like textures, circles, or eyes.
- Deeper layers continue this process, forming high-level features such as specific object parts (e.g., a bicycle frame, a nose, a face) and eventually entire objects.
This hierarchical learning process, which is inspired by the human visual cortex, enables the network to effectively recognize complex patterns necessary for tasks like image classification, object detection, and facial recognition.
In the context of image processing, the input data is typically a multi-dimensional array representing the pixel values of an image. Convolutional layers apply a set of learnable filters, also known as kernels, to these input arrays. Each filter is a small matrix that slides, or convolves, across the input data. As it moves, it performs element-wise multiplications and sums the results to produce a single output value at each position. This process is repeated across the entire input, generating a feature map, which highlights specific patterns or features present in the input data.
Convolutional layers also contribute to the network’s ability to learn complex patterns through the use of multiple filters. Each filter in a convolutional layer is capable of detecting different features, such as vertical or horizontal edges, textures, or color gradients. By stacking multiple convolutional layers in a network, CNNs can progressively learn more abstract and complex representations of the input data. This hierarchical feature extraction process enables CNNs to achieve high levels of accuracy in tasks like object detection, facial recognition, and image classification. Another significant advantage of convolutional layers is their parameter sharing and sparsity of connections, which lead to reduced computational complexity and memory usage.
Let us look at a model and its model summary
Model
model = tf.keras.models.Sequential([
# Note the input shape is the desired size of the image with 3 bytes color
# This is the first convolution
tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(100, 100, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
# The second convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The third convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The fourth convolution
tf.keras.layers.Conv2D(256, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# Flatten the results to feed into a DNN
tf.keras.layers.Flatten(),
# 512 neuron hidden layer
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
# Only 1 output neuron. It will contain a value from 0-1 where 0 for 1 class
# ('horses') and 1 for the other ('humans')
tf.keras.layers.Dense(1, activation='sigmoid')
])
Model Summary
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d (Conv2D) │ (None, 98, 98, 32) │ 896 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d (MaxPooling2D) │ (None, 49, 49, 32) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_1 (Conv2D) │ (None, 47, 47, 64) │ 18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_1 (MaxPooling2D) │ (None, 23, 23, 64) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_2 (Conv2D) │ (None, 21, 21, 128) │ 73,856 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_2 (MaxPooling2D) │ (None, 10, 10, 128) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_3 (Conv2D) │ (None, 8, 8, 256) │ 295,168 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_3 (MaxPooling2D) │ (None, 4, 4, 256) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten) │ (None, 4096) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense) │ (None, 512) │ 2,097,664 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense) │ (None, 256) │ 131,328 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense) │ (None, 1) │ 257 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 2,617,665 (9.99 MB)
Trainable params: 2,617,665 (9.99 MB)
Non-trainable params: 0 (0.00 B)
Explanation
Here’s a breakdown of what you would typically see:
- Layer (type): This column lists each layer in your model and its type (e.g.,
Conv2D,MaxPooling2D,Dense,Flatten). - Output Shape: This shows the shape of the tensor produced by each layer. For convolutional layers, it will show how the spatial dimensions (height, width) are reduced and how the number of filters increases. For dense layers, it shows the number of neurons.
- Param #: This is the most crucial part, indicating the number of trainable parameters in each layer. A parameter is a weight or bias that the model learns during training. The total number of parameters gives an idea of the model’s complexity.
Conv2Dlayers: Parameters are calculated as(kernel_width * kernel_height * input_channels + 1) * output_channels. The+1is for the bias term for each filter.MaxPooling2Dlayers: These layers have no trainable parameters as they just perform downsampling based on a fixed operation.Flattenlayer: This layer also has no trainable parameters, as it just reshapes the data.Denselayers: Parameters are calculated as(input_units * output_units) + output_units. The secondoutput_unitsterm is for the bias terms.
At the bottom, you’ll find:
- Total params: The sum of all trainable parameters in the model.
- Trainable params: The number of parameters that will be updated during the training process.
- Non-trainable params: The number of parameters that will not be updated during training (e.g., if you freeze some layers).
This summary helps you understand the flow of data through your network, the size of intermediate representations, and the overall capacity of your model.