Convolution Networks is a type of  of Neural Networks, inspired by the animal visual cortex. They are modeled after the   biological phenomenon that the individual cortical neurons respond to stimuli  in a restricted region of space known as receptive field.

They have been widely used in image and video recognition, recommender systems and natural language processing.

Below is an example of a ConvNets used for recognizing everyday objects, where  each colored boxes can be thought of as a stride of the ConvNets.

Convnets in  object recognition - with strides.JPG

Fig : Multiple Object Recognition  in a image REF: (2)

From Brute Force technique to the Intelligent Convolution Network

Problem : How do we identify each objects in the picture ?

Sliding Window Solution  : Approach to  Object Recognition in Image

What if, we could create a small block  of images and then slide it across the image, then we could identify images at any location in a picture i.e For example, we can identify an digit 8 at any position   in the image with the sliding window block example.Let’s see it in action

Sliding Window Block -  Convnets - Deep Neural Network.gif

Fig : Object Recognition with sliding window in action Ref : (3)

Problem Again : What if the size of  8 is very small or very large not fitting exactly  in our stride or our sliding window ? Our Sliding windows will fail in that case.

Brute Force Solution : One of the brute force solution, is to add more data with different sizes to our data and then train our model.  However this is an inefficient approach as there can be n – different sizes of data.

More Problem : More data = Hard for neural network to solve =  Require Large More layered  Networks

Intelligent Convolution Network solution :

  • They can easily understand Translation in-variance i.e 8 is an 8 no matter where it is located top , bottom , middle, slight left, slight right or any possible position.
  • Instead of feeding the entire image  to our neural network, we break the image into overlapping image tiles i.e


1 image INTO  77 tiny image tiles


  • Feed each image tile into a neural network to see if it was an “8” i.eConvolution Network - Explained.png
    Repeat this 77 times once for each image tile
  • Catch : Keep weights the same for each title.
  • Save the results from each tile into a new array (Preserve tile arrangement)
  • But this array is pretty big . So to reduce the size what we do is down-sample it using an algorithm called max pooling i.e Find the max value in each grid square in our array into the max-pooled array. (Main Idea : If we found something interesting  in any  of the four input tiles, we’ll only keep the most interesting bit. This reduces the size of our array while keeping the most important bits.)Convolution network -  max pooling .png
  • Final Step : Now once we have the fairly small array. Feed that small array into another neural network that will decide if the image is a match or isn’t  a match.
  • To sum up, here’s the whole process in simple diagramconvolutional network in 1 image.png

More Complex Convolution Networks :  For real world problems, simple 1 series of steps of convolution, max-pooling and finally a fully connected network might not be enough and we often will have to  combine these steps and  stack as many times as needed. Also  we can throw in the max pooling  wherever is needed, to reduce the data size.

  • More  convolution steps often  means the more complicated features our network can learn to recognize. For example the first ConvNet might learn to recognise sharp edges, second ConvNet might recognise beaks from sharp edges, third Convnet might learn entire bird . A  more relistic deep convolutional network might look more like ConvNets in Real world.png
  • In the above case,  they start with 224 * 224 pixel image, then apply Convolution then apply max pool then again apply Convolution and again Max pooling, then they apply Convolution for three times and  then apply max pooling with two fully connected layers and finally have the image classified into  one of the 1000 categories.

So which one is the right network ?

This one can be most  probably be answered with  a lot of experimentation [3].




  1. Image representation in pixels with a image


Fig . Image as Pixel values, in our digit recognition project case REF: (3)

  1. Four core basic building blocks of every  ConvNet  Neural Network:
  1. Convolution
  2. Non Linearity (ReLU)
  3. Pooling or Sub Sampling
  4. Classification (Fully Connected Layer)