Problem : I do not have big data set. I want to squeeze every bit of data I can use from it. How do I do it ?
Answer :  We can transform the data in some way. That way we can use artificially expand the dataset through the different transformation ways i.e
1. Horizontal Flips
2. Random crops / scales
3.Color Jitter – brightness, contrast , PCA driven color augmentation
4. translation, rotation, stretch, shearing, lens distortion  e.t.c

Data Augmentation

Problem : How do I prevent over-fitting ?
Answer :  Use Dropout, Drop Connect, Batch Normalization techniques.

Problem : Is there a better  DAta overfitting prevention strategy ?
Answer: The better strategy is to Add Noise during training and use marginalisation during end layer.

Problem : Do I need a lot of data ?
Answer :  Well NO, we can use transfer learning.  We can download a pre trained model from the internet, freeze the pre-trained layers, use them as feature extractor and add a layer on top of that and train it. Its very very common and provides a strong baseline further.
If we have medium data-set, then we freeze some part of pre-trained layers and add layers on to p of it and retrain with your medium sized data.

CNNs rtransfer learning.jpg

Problem : What if my input data is very different to the data used by the ConvModel that I intend to use ?
Answer : We should take account into the data similarity. If we have very little data and our model is very similar then we can use the linear classifier on the top layer.
If our data is similar and we have a lot of data then we can fine-tune a few layers with the richer data set to make the model more accurate to our needs.
If our Data is very different  and we have very few data then its a bit complicated  and instead of extracting the features from the last layer we can try extracting the features from different layers and that can help sometimes. The foundation is that for the MRI data there might not be wheels or rims or beaks as observed on the top layers of the imagenet. However the lower level features such as the horizontal slide , vertical lines  can be more transferable  from the object recognition domain to the  MRI domain. May be we can the lower level features into our to be trained layers to extract different MRI specific high level features from them.
If we have very different data-set and a lot of data then we can fine-tune a large number of layers i.e use weights as is in lower layers and fine-tune the weights on the higher layers to capture different higher level knowledge.

Fine Tune Transfer Convets.png

 Question : Is the transfer learning with CNN’s a hack or a standard norm ?
Answer : The Transfer learning with CNN, is  a standard norm these days and not a exception. For example, the object detection and the image captioning were both trained over a ImageNet model downloaded from internet.
We will not want to train a model from scratch  unless we have a very very large dataset. Fine-tuning always works well.

Problem : Where can I download the models from ?
Solution  :  The collections of modess can be found on “Model Zoo” of Caffe ConvNet libraries. We can also use the Cafee models with torch or other libraries.

Problem : How big region of a input does a neuron of the second conv layer see ?
Answer : All i.e  5 by 5

Question : Similarly, if we stack a three 3 x 3 conv layers, how big of an input region does a neuron in the third layer see ?
Answer : 7 x 7 i.e Three 3 by 3 conv layer gives a similar representational power as a single 7 by 7 convolution

Question : If so, which one should we use then ?
Answer :  Lets explain with an example. If we have a input of H x W x C  with C filters to preserve depth  and stride 1 and padding to preserve H and W then
1 CONV 7 x 7 filter :: No of weights = C x ( 7 x 7 x C ) =  49   C^2
3 CONV 3 x 3 filter :: No of weights = 3x C x ( 7 x 7 x C ) =  27   C^2
Multiple 3 by 3 is hence preferable to one 7 by 7

Filters more or less.png

Problem  : So  if small filters are better then why not use 1 x 1 instead of 3 x 3 ?
Answer : With 1 x 1 across multiple layers,  higher layers will not be able to look into big region of input but will always be looking into the same region of input no matter what layer. So we use a slightly optimized model to get the advantage of both the world i.e
1. bottleneck 1 x 1 conv  to reduce dimension
2. 3 x 3 conv at reduced dimension
3. Restore dimension with  1 x 1. conv.
This idea was also used in GoogleNets and ResNets.

1 x 1 CONV.png

Key Takeaways : 

  1. Replace large convolutions with stack of 3 x 3 convolutions
  2. 1 x 1 bottleneck convolutions are better
  3. Can factor N x N convolutions into 1 X N  and N x 1. Not used widely but may be in future
  4. All of the above give fewer parameters, less compute and more linearity

Question :  How can I speed up the ConvNet?
Answer : We can speed up the ConvNets by using Fourier Transform multiplication. While they work great for  deeper layered ConvNets, using it in shallow networks are detrimental, in that the overhead of the Fourier computation  preparation outweighs the benefits of the fourier itself.
However it also have been observed that  they do not work well with small filters i.e 3 x 3.  And  because of the benefits of small filters, the FFT’s unfortunately although a speed up does not  provide good benefits.

Question : Well, if FFT does not go well with small filters, is there other ways to improve speed?
Answer :  Since ConvNets  is a lot of matrix multiplication, if we can somehow speed up the matrix multiplication, then  we can speed the ConvNets. Well there’s a algorithm called Strassen’s algorithm that can do so from O(N ^ 3) to O (N ^ 2.8)This algorithm is efficient for matrix larger than 1000 size   and the benefits slows down for sizeof thousands to  at best around 10%. For smaller matrix size, however  native matrix multiplication is  better. Strassen to add more, is also vector processor compliant, in that  we can do Strassen’s algo with GPU’s.
With VGG, the speed up ranges from 6x to 2x  fro small to large data size.
We can also try Coppersmith–Winograd algorithm, which is much more better with efficiency at O(N ^ 2.35). See also its practical uses as many have pointed out its not very  practical and vector processor compatibility.