Image Localisation

Problem : How to identify where my cat is in the image. i.e which  coordinates in img ?
Solution :Simple :  From Image to class to Image to box with two heads.
1. One head: Classification head : that classifies
2. Second head : Regression head that gives the 4 coordinates where the object is.

Problem :  How does it work ?
Solution :  OverFeat Architecture .  Slide windows across the whole image.   i.e One window on the upper left corner, another upper – upper right corner and so on and on.

Problem : Too many sliding window locations  to calculate. Too many  image scales to consider?
Solution : Pretty expensive to  run the network on every one of the crops. 1 image – n crops to evaluate. How do we  make this efficient ? Answer is Efficient Sliding window  i.e  Instead of using the NN – FC classifier, we use the convnet for classification and regression.

Sliding Window- ConvNET  To  Efficient Sliding Window- ConvNET

Use 1 by 1 convolution i.e 4096 low level features(i.e vertical lines, horizontal line, slanted line ) extracted at each 4096 layers. These low level features are then combined into  1024 layers to forma more higher layer features i.e ( circle , spiral, oval, circle with vertical line in it e.t.c)

Problem : Ok  , now it can correctly  yield the objects and where the object is? But what if  my image has  multiple objects ?
Solution :  This is where object detection comes in.   object detection - ConvNet.JPG

Problem :  Well what if my image has multiple objects, sometime 2, sometimes 4 ?
Solution : Well in that case,  Classification is our solution . We can classify image, hence we take patches of images and then classify across it.

Problem :  Well, but how do I know what window size to consider. Also, do I need to try all window size and positions and scales ? Its  tremendously expensive.
Solution:  BruteForce. Create Fast classifier and try all. HOG ( Histogram of oriented gradient  at multiple scales) and run a linear classifier.

Problem : Can we  not use the bruteforce – try all approach ?
Solution :  Cut down the search space from all to relevant. If we could only process blob like structures, excluding all others.  i.e  For a cat and dog sitting in a green grass image, cat will appear as a blob, dog will act as a blob, flowers will act as a blob. This  process is known as “Region Proposals”.  They are not very accurate but are fast. They put boxes around the blobby region. Most famous one is Selective Search.

region proposal.jpg

Region Proposal – CNN ( R-CNN):

Problem :  R-CNN is  inefficient ? Slow /  Complex Pipeline
Solution :  Fast R-CNN. Instead  of  extracting the regions of interest from the image itself, what if we extract the “Region blobs” or region of interest from the  “CONV Feature Maps”.

fast r-cnn.jpg

Problem :  Now the Fast R-CNN is so fast that now the actual speeed bottleneck is the Region Pooling  i.e  finding the region of interest i.e slows from 125x to 25x due to this region proposal?
Solution :  What if we could make the CNN’s dot he region pooling too, it will be faster than, since it is what made the Fast R-CNN from R-CNN i.e  Faster R-CNN. Region Proposal network is a CNN layer that is able to look into the last CON layer and then create the region of interest from it.

Problem :  Well great,  but how does the Region proposal network work ? Is it real or a theoretical thing?
Solution :  Sliding window over CON feat map that classifies  object or not and regression to regress the box coordinates. However in reality they use the anchor boxes at each points where classification gives object or not and regression gives offsent from anchor bozes i.e if the image is 5 px from the left of the anchor box e.t.c

Faster C-NN   Faster C-NN anchor box approach.jpg

With Faster R-CNN, you get the speedup because, we are using the same convolution layer for the feature region pooling, classification and regression i.e sharing the CONV layer, hence its faster.

Faster C-NN Results.jpg

Problem :
So what is the besty Object detection network till 2016 then ?
Answer :
It’s ResNet 101 + Faster R-CNN + some extras

Codes at

  1. Faster R-CNN : caffe + python implementation available
  2. YOLO : Very fast but sightly reduced acuracy. Codes available and useful, if we dont have large GPUs
  3. ResNets weights available in githubs