Sliding Window Detection
Last updated
Last updated
Once a CNN is trained using cropped images that contain just the target object (minimal background), say a car, it will be able to detect a car in a test image with background by sliding over the image and looking for a car in parts of the image (each part is passed as an input to the CNN which classifies it as car/not car).
This is, however, computationally inefficient.
A more efficient convolutional implementation of sliding windows is discussed below.
Consider the CNN shown below:
We can convert the FC layers (and softmax layer) to convolutional layers as follows:Say we have a 28x28 image. If we used the traditional sliding window approach with a 14x14 window and a stride of 2, we would pass the window 8 times from left to right for each row.
If we use the following convolutional approach, the same result can be obtained in a single forward pass, and the results obtained by each slide of the window in the traditional approach would be the same as that computed in the output of the CNN i.e. each of the 8 labels obtained per row in the traditional approach would match the first row of the 8x8x4 output of the CNN.
This approach is significantly more efficient.