Some Practical Examples
- Image captioning : Image comes in and we generate sequence of words
- Sentiment classification : consume a number of words and then classify the sentiment of the sentence.
- One to Many : e.g Take in a image and generate sequence of texts or captions describing it
- Many to One : e.g Take in no of words and calculate positive or negative sentiment for the sentence
- Many to Many : e.g Take in no of words in English and convert to no. of words in French
- Many to Many : e.g Take in no of frames of videos and calculate if the video is child safe or not.
e.g in one of the house number generation from the picture case at google, rather than taking the single image and generating house number, they took a sequential approach i.e feeding patches of images from left to right and coming up with the house number then.
Working : For the RNN, we apply the same functions and same set of parameters at every time step and this is what allows the RNN to handle any size.
Rnn does not know anything ion the beginning, but as we go on training then it learns that quote starting should have quote ending, learns about spaces and so on with more and more iterations.
A Simple Example Of Character-level language model :
- First we apply one hot encoding to each letters
- Then we use RNN
- For our Output we want the ‘2.2’ to be high but isn’t so and the RNN will then back propagate to adjust for weights such that when we have ‘h’ then it generates output with ‘e’ as maximum score.
Important RNN Parameters :
- Seq_length : For RNN we cannot remember through all the data of a book or text, as it will be very resource intensive, hence with seq_length we adjust for how much of the historical data are we going to consider i.e with seq_length = 25, we are going to remember 25 chars at a time for training.
- Regularization : In RNN Regularization, sometime yields bad results to not using one, so it might be a tricky one to use with RNN.
Problem : Can a RNN learn a sequence of more than 25, if seq_length is capped at 25 ?
Answer : The RNN learns from the sequence of 25, however when it comes to test data it can generalize the learning to more than 25 length . for example in the text, for very opening quotes there is a closing quote.This it can learn from 25 seq length, but in test it can apply the same rule i.e opening quotes and closing quotes for more than 100 character text, even if it learned for 25 characters.
Question : How do we generate captions for the images using the RNN ?
Answer : Unlike in earlier architectures, where we feed the image the single time, we can apply the RNN to look back at the image and reference parts of image while it is describing. WE can do this in a fully trainable way, such that RNN does not only generate the words but knows where to look at next.
Question : What is the difference between RNN and LSTM? Why do we need LSTM when we have RNN?
Answer : LSTM is the standard now. LSTM is exactly the same as the RNN, the only difference is that the recurrence formula is more complex. We are still taking the hidden vector from below layer and pre-time, concatenating them and multiplying with Weight Transform, but now we have more complex formula to update the hidden state. The difference in formula can be seen highlighted in red in the pic below. And all the power is in the complex formula fro LSTM.
Also further more in the RNN’s we have super highways of gradient which flow backwards and forwards, however in the LSTM, we have forget gates based upon which the gradients might be killed, depending upon when its desired.
Problem : Why do we need to use the more complex formula ?
Answer : IF X = input, h = prev hidden state, then in LSTM we are going to produce 4n . I, F and o go through the sigmoid activation functions while g go through the tangent state. Think of i, f and o as binary for now, Later we make it sigmoid because we want to make it differential so that we can back-propagate.
What c is doing is, f is forget gate, which determines, if it should remember old values or not. When f is zero i.e forget gate is enabled, then we discard old values then add input multiplied by g i.e tanh or value between -1 and 1. As for the hidden weight computation, the hidden weights are controlled with the gate o which determines, if the earlier hidden weights should be exposed or not.
f = Forget or not previous input values.
g = How much do we want to add to the cell state ?
i = Do we want to add to the cell state ?
i*g = more richer function ?
Question : Can LSTM be applied to ConvNets to improve ConvNet performance ?
Answer : Yes we can, and it actually is what is done by the RESNets. ResNet is to PlainNet what LSTM is to RNN.
Problem : Why should my Forget gates be positive initially ?
Answer : In a LSTM we always initialize the bias in the forget gates to be positive, because initially we want the forget gates to be turned off so that gradients can flow through .Then during the training, the network will learn when to turn off the forget gates.
Problem : My gradients are exploding ro vanishing . What do I need to do ?
Answer : Exploding gradients can be controlled with gradient clipping. As for the vanishing gradient it can be controlled with additive interactions as in LSTM.