Some Practical  Examples

  1. Image captioning : Image comes in  and we generate sequence of words
  2. Sentiment classification : consume a number of words and then classify the sentiment of the sentence.


  • One to Many :  e.g Take in a image and generate sequence of texts or captions describing it
  • Many to One : e.g Take in   no of words and calculate positive or negative sentiment  for the sentence
  • Many to Many : e.g Take in  no of words in English and convert to no. of words in French
  • Many to Many : e.g Take in  no of frames of videos and calculate if the video is child safe or not.


e.g in one of the house number  generation from the picture case at google, rather than taking the single image and generating house number, they took a sequential approach i.e feeding patches of images from left to right and coming up with the house number then.

Rnn digit recognition sequential.jpg

Working :  For the RNN, we apply the same functions and same set of parameters at every time step and this is what allows the RNN to handle any size.
Rnn does not know  anything ion the beginning, but as  we go on training then it learns that quote starting  should have quote ending, learns about spaces and so on with more and more iterations.

A Simple Example Of Character-level language model :

  1. First we apply one hot encoding to each letters
  2. Then we use RNN
  3. For our Output we want the ‘2.2’ to be high but isn’t so and the RNN will  then back propagate  to adjust for weights such that when we have ‘h’ then it generates output with ‘e’ as maximum score.

Character-level language model -RNN.jpg

Important RNN Parameters :

  1. Seq_length  : For RNN we cannot remember through all the  data of a book or text, as it will be very  resource intensive, hence with seq_length we adjust for how much of the historical data are we going to consider i.e with seq_length = 25, we are  going to remember  25 chars  at a time for training.
  2. Regularization : In RNN Regularization, sometime yields bad results to not using one,  so it might be a tricky one to use with RNN.

Problem : Can a RNN learn a sequence of more than 25, if seq_length is capped at 25 ?
Answer :  The RNN learns from the sequence of 25, however when it comes to test data it can generalize the learning to more than 25 length . for example in the text, for very opening quotes there is a closing  quote.This it can learn from 25 seq length, but in test it can apply the same rule  i.e opening quotes and closing quotes for more than 100 character text, even if it learned for 25 characters.

Question : How do we generate captions for the images using the RNN  ?
Answer :  Unlike in earlier architectures, where we feed the image the single time, we can apply the RNN to look back at the image and  reference parts of image while it is describing. WE can do this in a fully trainable way, such that RNN does not only generate the  words but knows where to look at next.

Question : What is the difference between RNN and LSTM? Why do we need LSTM when we have RNN?
Answer : 
LSTM is the standard now. LSTM is exactly the same as the RNN, the only difference is that the recurrence formula is more complex.  We are still taking the  hidden vector from below layer and  pre-time, concatenating them and  multiplying with Weight Transform, but now we have more complex  formula to update the hidden state. The difference in formula can be seen highlighted in red in the pic  below. And all the power is in the complex formula fro LSTM.

RNN vs LSTM.jpg

Also further more in the RNN’s we have super highways of gradient which flow backwards and forwards, however in the LSTM, we have forget gates based upon which the gradients might be killed, depending upon when its desired.

LSTM  vs RNN.jpg

Problem :  Why do we need to use the more complex formula ?
Answer :  IF X = input, h = prev hidden state, then in LSTM we are going to produce 4n . I, F and o go through the sigmoid  activation functions while g go through the tangent state. Think of i, f and o as binary for now, Later we make it  sigmoid because we want to make it differential so that we can back-propagate.
What c is doing is, f is forget gate, which  determines, if it should remember old values or not. When f  is zero i.e forget gate is enabled, then we discard old values then add input multiplied by g i.e tanh or value between -1 and 1.  As for the hidden weight computation, the hidden  weights are controlled with the gate o which determines, if the earlier hidden weights should be exposed or not.
where ,
f = Forget or not previous input values.
g =  How much do we want to add to the cell state ?
i =  Do we want to add to the  cell state ?
i*g = more richer function ?


Question : Can LSTM be applied to ConvNets to improve ConvNet performance ?
Answer :  Yes we can, and it actually is what is done by the RESNets. ResNet is to PlainNet what LSTM is to RNN.

LSTM vs ResNets.jpg

Problem : Why should my Forget gates be positive initially ?
Answer : In a LSTM we always initialize the bias in the forget gates to be positive, because initially we want the forget gates to be turned off so that gradients can flow through .Then during the training, the network will learn when to turn off the forget gates.

Problem :   My gradients are exploding ro vanishing . What  do I need to do ?
Answer :  Exploding gradients can be controlled with gradient clipping. As for the vanishing gradient it can be controlled with additive interactions as in LSTM.