Problem : How to identify where my cat is in the image. i.e which coordinates in img ?
Solution :Simple : From Image to class to Image to box with two heads.
1. One head: Classification head : that classifies
2. Second head : Regression head that gives the 4 coordinates where the object is.
Problem : How does it work ?
Solution : OverFeat Architecture . Slide windows across the whole image. i.e One window on the upper left corner, another upper – upper right corner and so on and on.
Problem : Too many sliding window locations to calculate. Too many image scales to consider?
Solution : Pretty expensive to run the network on every one of the crops. 1 image – n crops to evaluate. How do we make this efficient ? Answer is Efficient Sliding window i.e Instead of using the NN – FC classifier, we use the convnet for classification and regression.
Use 1 by 1 convolution i.e 4096 low level features(i.e vertical lines, horizontal line, slanted line e.tc ) extracted at each 4096 layers. These low level features are then combined into 1024 layers to forma more higher layer features i.e ( circle , spiral, oval, circle with vertical line in it e.t.c)
Problem : Ok , now it can correctly yield the objects and where the object is? But what if my image has multiple objects ?
Solution : This is where object detection comes in.
Problem : Well what if my image has multiple objects, sometime 2, sometimes 4 ?
Solution : Well in that case, Classification is our solution . We can classify image, hence we take patches of images and then classify across it.
Problem : Well, but how do I know what window size to consider. Also, do I need to try all window size and positions and scales ? Its tremendously expensive.
Solution: BruteForce. Create Fast classifier and try all. HOG ( Histogram of oriented gradient at multiple scales) and run a linear classifier.
Problem : Can we not use the bruteforce – try all approach ?
Solution : Cut down the search space from all to relevant. If we could only process blob like structures, excluding all others. i.e For a cat and dog sitting in a green grass image, cat will appear as a blob, dog will act as a blob, flowers will act as a blob. This process is known as “Region Proposals”. They are not very accurate but are fast. They put boxes around the blobby region. Most famous one is Selective Search.
Region Proposal – CNN ( R-CNN):
Problem : R-CNN is inefficient ? Slow / Complex Pipeline
Solution : Fast R-CNN. Instead of extracting the regions of interest from the image itself, what if we extract the “Region blobs” or region of interest from the “CONV Feature Maps”.
Problem : Now the Fast R-CNN is so fast that now the actual speeed bottleneck is the Region Pooling i.e finding the region of interest i.e slows from 125x to 25x due to this region proposal?
Solution : What if we could make the CNN’s dot he region pooling too, it will be faster than, since it is what made the Fast R-CNN from R-CNN i.e Faster R-CNN. Region Proposal network is a CNN layer that is able to look into the last CON layer and then create the region of interest from it.
Problem : Well great, but how does the Region proposal network work ? Is it real or a theoretical thing?
Solution : Sliding window over CON feat map that classifies object or not and regression to regress the box coordinates. However in reality they use the anchor boxes at each points where classification gives object or not and regression gives offsent from anchor bozes i.e if the image is 5 px from the left of the anchor box e.t.c
With Faster R-CNN, you get the speedup because, we are using the same convolution layer for the feature region pooling, classification and regression i.e sharing the CONV layer, hence its faster.
Problem : So what is the besty Object detection network till 2016 then ?
Answer : It’s ResNet 101 + Faster R-CNN + some extras
- Faster R-CNN : caffe + python implementation available
- YOLO : Very fast but sightly reduced acuracy. Codes available and useful, if we dont have large GPUs
- ResNets weights available in githubs