Traditionally, we refer to using either max pooling, or 1 x 1 convnet or 3 x 3 or 5 x 5 convnet. All these gather different information. Instead of using these one at a time in a layer, what if we could use all of these in a single input layer, then we create more better models.
For example, in the image below, we use the following on a single Input layer
- Average pooling
- 1 x 1 Convolution Network
- 1 x 1 Convolution Network followed by 3 x 3 Convolution Network
- 1 x 1 Convolution Network followed by 5 x 5 Convolution Network
And on the top we concatenate all these four. In inception, we can choose the parameters in such a way that despite having all these numerous ConvNets, that the total number of hyper-parameters are simple, thus reducing the chance of over-fitting, while performing significantly better than the conventional convolution networks.
Fig Inception Module REF: Udacity