Foundations of Convolutional Neural Networks
Last updated
Last updated
Computer Vision is the application of Machine Learning techniques in Image Processing.
Deep Learning has several applications in Computer Vision: Image Classification, Object Detection, Neural Style Transfer (ex. Prisma) etc.
If we use large images for training a deep neural network, we would have very large-dimensional feature vectors and even larger dimensional weight vectors. To prevent this, we use a mathematical operation known as convolution.
The convolution operation is denoted by an asterisk (*). It performs an element-wise multiplication with a sliding convolutional matrix called the filter.
The above image illustrates the convolution operation between an image and the sliding filter (in yellow), which results in the convolved feature matrix.
(Every element in the convolved matrix is obtained by summing up the element-wise products between the filter and the image area covered by the filter, as shown above).
This operation is commonly used for edge detection.
Some commonly used filters for edge detection are:
However, we can also learn the filter.
If we convolve an nxn image with an fxf filter, we get an output image with the dimensions (n-f+1)x(n-f+1). Therefore, the size of the image becomes smaller after every convolution.
Also, the pixels at the corners are used only once whereas pixels at other locations in the image are used during multiple convolutions as the filter slides over the image. This leads to loss of information at the corners and edges of the image.
To combat these two issues, we pad the input image with more pixels before applying the convolution operation. This would preserve the size of the image and prevent loss of valuable information.
We denote the padding size as p. If p=1, it means that we add a border of 1-pixel width to the image.
The resulting dimensions of the convolved image would be (n+2p-f+1)x(n+2p-f+1).
Note that a valid convolution would involve 0 padding and a same convolution would involve p=(f-1)/2 so that the size of the convolved image would be the same as that of the original input image. (f is usually odd).
Instead of moving the filter by 1 step, we can move it by s steps (strides). This is called strided convolution.
Since RGB images have 3 channels, image dimensions are nxnx3. So, we must use an fxfx3 (3-layer) filter. The resulting convolved matrix will be (n-f+1)x(n-f+1) and each of the elements in the resulting matrix is the sum of the element-wise products of the corresponding filter elements and the image elements over the 3 channels.
Note that the filter's 3rd dimension must match the number of channels in the image.
A CNN is a deep neural network that has one or more convolutional layers and each convolutional layer can have different types and numbers of filters.
Most CNNs also have pooling layers and fully-connected layers.
CNNs automatically extract and learn features from images, starting from low-level features like edges and proceed toward learning higher-level features.
Pooling layers are used for dimensionality reduction. The following is an example of the max-pooling operation:
We basically take the max value of the image region covered by the filter.
Note that the pooling layer doesn't have any learnable parameters.
A fully-connected layer is a layer in which every neuron is connected to each neuron in the previous layer. It is basically used to give names to the learned patterns, i.e. to associate them with class labels.
Parameter Sharing
It allows a feature detector to be used in multiple locations throughout the whole input image/input volume.
Sparsity of Connections
It allows a feature detector to be used in multiple locations throughout the whole input image/input volume.
The sobel filter:
The scharr filter:
The resulting dimensions would be .