Convolution Networks is a type of of Neural Networks, inspired by the animal visual cortex. They are modeled after the biological phenomenon that the individual cortical neurons respond to stimuli in a restricted region of space known as receptive field.
They have been widely used in image and video recognition, recommender systems and natural language processing.
Below is an example of a ConvNets used for recognizing everyday objects, where each colored boxes can be thought of as a stride of the ConvNets.
Fig : Multiple Object Recognition in a image REF: (2)
From Brute Force technique to the Intelligent Convolution Network
Problem : How do we identify each objects in the picture ?
Sliding Window Solution : Approach to Object Recognition in Image
What if, we could create a small block of images and then slide it across the image, then we could identify images at any location in a picture i.e For example, we can identify an digit 8 at any position in the image with the sliding window block example.Let’s see it in action
Fig : Object Recognition with sliding window in action Ref : (3)
Problem Again : What if the size of 8 is very small or very large not fitting exactly in our stride or our sliding window ? Our Sliding windows will fail in that case.
Brute Force Solution : One of the brute force solution, is to add more data with different sizes to our data and then train our model. However this is an inefficient approach as there can be n – different sizes of data.
More Problem : More data = Hard for neural network to solve = Require Large More layered Networks
Intelligent Convolution Network solution :
- They can easily understand Translation in-variance i.e 8 is an 8 no matter where it is located top , bottom , middle, slight left, slight right or any possible position.
- Instead of feeding the entire image to our neural network, we break the image into overlapping image tiles i.e
1 image INTO 77 tiny image tiles
- Feed each image tile into a neural network to see if it was an “8” i.e
Repeat this 77 times once for each image tile
- Catch : Keep weights the same for each title.
- Save the results from each tile into a new array (Preserve tile arrangement)
- But this array is pretty big . So to reduce the size what we do is down-sample it using an algorithm called max pooling i.e Find the max value in each grid square in our array into the max-pooled array. (Main Idea : If we found something interesting in any of the four input tiles, we’ll only keep the most interesting bit. This reduces the size of our array while keeping the most important bits.)
- Final Step : Now once we have the fairly small array. Feed that small array into another neural network that will decide if the image is a match or isn’t a match.
- To sum up, here’s the whole process in simple diagram
More Complex Convolution Networks : For real world problems, simple 1 series of steps of convolution, max-pooling and finally a fully connected network might not be enough and we often will have to combine these steps and stack as many times as needed. Also we can throw in the max pooling wherever is needed, to reduce the data size.
- More convolution steps often means the more complicated features our network can learn to recognize. For example the first ConvNet might learn to recognise sharp edges, second ConvNet might recognise beaks from sharp edges, third Convnet might learn entire bird . A more relistic deep convolutional network might look more like
- In the above case, they start with 224 * 224 pixel image, then apply Convolution then apply max pool then again apply Convolution and again Max pooling, then they apply Convolution for three times and then apply max pooling with two fully connected layers and finally have the image classified into one of the 1000 categories.
So which one is the right network ?
This one can be most probably be answered with a lot of experimentation .
- Image representation in pixels with a image
Fig . Image as Pixel values, in our digit recognition project case REF: (3)
- Four core basic building blocks of every ConvNet Neural Network:
- Non Linearity (ReLU)
- Pooling or Sub Sampling
- Classification (Fully Connected Layer)