Question : What is a Skip-gram model?
Answer : A skip-gram model is a dense approach of creating a word vectors using the neural Network. The aim of the neural network in this case, is to predict contextual or neighboring words, from a word.
Question : Why do we use it?
Answer: We use the skip-gram model over the SVD based dense vectorisation mainly because
- Scalable / Easy to add new words – does not require retraining across the whole corpus
Question: When is Skip-Gram model most useful?
Answer : Since Skip-gram model treats each context-target pair as new observation, this is most efficient for larger data-sets.
Question: How does the skip-gram model train itself?
Answer : Skip-gram model computes the probability of the center word appearing with the context words by computing the similarity with the dot product, then it converts the similarity into probability by passing it through the soft-max functions.
We have the center word represented by W(t) . Then we have W i.e representation of the center words. When multiply the center word one hot vector with center word Vector representation, then we get output word representation V(c).
Then we have the second matrix or the representation of the context words.When we multiple the Center word vector representation with the context word matrix, then we get the Output word vector, as provided by our ML model i.e U(o) Transpose * V(c).
Now we convert that model output to probability using the soft-max function. Then we compare the soft-max with the True Word vector. If our predicted soft-max output is too off or wrong, then we do the numeric optimization such that the correct matrix is learnt.
Question: Why are we representing each words twice once as a center word matrix and then again as the outside word matrix?
Answer: We do this because using the context words add a lot of flexibility to our model. In case of complex languages such as French, English with compound nouns or Japanese languages, we will need to do character based copntextualy evaluation, in such case our context word will be vastly different to the target word. For example, if we want to use two-gram word for contextual evaluation, then we must use two different word vector.
Question: Can we explain the skip-gram to me with a simple example ?
Answer : The skip-gram model creation is a easy three step process.
1. Create A data-sets of (context, word) pairs i.e words and the context in which they appear e.g
“The quick brow fox” – if is document then data-set of word could be
( [ the, brown ], quick ) , ( [quick, fox] , brown ) ….
i.e Given a word “quick”, the skip- gram model tries to predict the context in which it appears frequently i.e between “the” and “brown” words context.
In skip-gram models we maximize the probability of the outside words given the center word.
Here’s the visualization of skip-gram model to understand it better
Question: How is the Context model represented across the training model?
Answer : The (context, target) pairs is modeled as (input, out pairs) as
( [ the, brown ], quick ) ===> (quick, the) , (quick, brown)
Question : What window size for skip-grams should we use ?
Answer : Yes we can use any window- size 1 to n , where 1 takes in word context 1 distance away while 2 windows size takes in word context 2 distance away.
Question : The insight that the word “quick” appears in the context of “the” and “brown”, is not relevant, especially with the commonly used word “the”?
Answer : The best option will be to run a TF-IDF, to remove the commonly occurring words such as the, a , an e.t.c
Question: Now let’s say we have initially initialized a random vectors to the words and then our soft-max comparison finds that it s a way off. How do we back-propagate the error then and determine how we should change the vector?
Answer: We determine the factor we should change the vector by calculating the gradient loss of the loss as shown in the figure below.
Question: We are using the dot product as the probability for the similarity measure ? Why are we doing so, what about using the cosine similarity?
Answer: Well, it’s the simplest math. As for the vector similarity the common measure is the cosine similarity. And we could cheat the dot similarity by cheating on the length. But however since we are predicting each word vs each other vector. Hence if we cheat by making one word large then it will make the others words large too since we are also evaluating it in terms of the lengthen vector. Hence in this particular case, it does not make so much effect, and to add more dot product is simple to compute.
Question: : In all cases of machine learning, we try to reduce the loss function. i.e given a loss curve or error curve we want to find the lowest error point. How is it done. Can you show it with an example code?
Question: : In gradient descent , what does alpha mean and why should it be small?
Answer: As seen in the above figure as well, alpha is the amount with which we climb down the hill. Now if our alpha is too large then we may overshoot the minimal point and keep on flipping to and fro on either side of the lowest error point.
alpha = step_size (gradient descent size)
Question: : Why is gradient descent always done in small batches?
Answer : Well, gradient descent is always done in small batches, because if we do a very large step update, then for a U shaped valley we might overshot the minimum and jump back and forth between the U shaped valley.
Question: : We have 40 billion token for text, so to make a single climb up or down, should we run through all the 40 billion tokens?
Answer: If we do it, it will be extremely expensive, hence what we do is we take a subset of training examples and do a run through only those small subset. This makes the neural network, pretty fast rather than running over all 40 billion texts. and this is known as Stochastic Gradient Descent (SGD).
Question : When we compute the probability of the outside word given a center word, the upper part is simple i.e multiple 100 dimensional vector with 100 dimensional vector, but the gigantic sum over all the vocabulary i.e millions of words i.e SUM of vectors over (w =1 to v) is infeasible? So what should we do?
Answer : The core idea of the word2vec is to maximize the similarity between the words which appear close together and minimize the similarity between the words that do not appear together.
The first part i.e maximize the similarity between the words which appear closer together, as given by the numerator (i.e v_c * v_w ) is computable since the words that appear closer together is limited.
However to minimize the distance between the words that do not appear together, as given by the denominator ( i.e sum( v_c1 * v_w ) ), there are million of such non related words and if we try to compute the dot product of all such pairs then it is infeasible computationally. Hence we use negative sampling, so that the problem is computationally feasible.
v_c * v_w ------------------- sum(v_c1 * v_w)
REF: Negative sampling
Hence the most commonly used algorithm is the Skip-gram model with negative sampling.
Skip-Gram with negative sampling
Question: What do we do negative sampling in word2vec?
Answer: With negative sampling in word 2 vec we take all the related word from the related word pairs, while, of the billions of non related word pairs, we only take (n_sample_size – n_correct) incorrect word pairs, which is in hundreds on unrelated pairs vs millions of unrelated pairs. This consideration of only 100s of unrelated pairs, hence makes the p(o|c) i.e probability of outside word given center word computationally feasible.
Question: : What is negative sampling?
Answer: The vector word matrix is sparse and for a word lets say “NLP”. It has a lot of words not related to it. For example, zebra, lion, tiger e.t.c.and 100s of thousands of other words unrelated. And we know that these word vectors are never going to be updated when we are calculating the exact word vector for the word “NLP”. This makes the update computationally expensive as they occupy memory space. Hence what we do is negative sampling. i.e consider all the N(r) relevant words and then k – N(r) words where k = some constant number of negative samples. i.e few noise words i.e (noise center word paired with a random word.)
Question : We have seen sigmoid function often being used ? Why do we use it?
Answer: As we already know, there are some words whose frequency is very high. What sigmoid function does is clip the values to max length as can be seen in the sigmoid graph, instead of exponential word frequency.
Question: : How do we do skip-gram negative sampling?
Answer : Let’s consider an example text as below,. With L = 2, and considering the word “apricot” as the input words, we will have then 4 context words. The aim is to maximize the similarity with the context word by computing the dot product.
To make the skip-gram algorithm, more efficient we use the sigmoid function of the word-context pair i.e
where sigmoid is, .
The reason, we use sigmoid, is to clip the commonly occuring word frequency to certain maximum. As for the choice of the negative samples, we choose k random noise word i.e if k =2, then for each target – context pair, we choose 2 negative target – noise word pair. i.e In the above case of four context words, we will consider 8 noise words.
In our model, we want to have low similarity of the input word with these negative words i.e low