Continous Bag of Words ( CBOW ) model

Question :  Why use CBOW, when we have skip-gram models?
Answer: CBOW is more richer  and performs more better to the skip-gram models

Question : Why do the CBOW perform better to skip-gram model ?
Answer: Well, since they are able to  make use of more data during the input than in the skip-gram model, they are more able to better capture the word vectors efficiently.
For example, in skip gram models, there is one input word and one output word
Assuming the sentence,
“The dog barked at the mailman. The mailman dropped the mail and went away
skip-gram model = (e.g. input:'dog',output:'barked')
while, with CBOW = input:['the','barked','at'],output:'dog'
We can  clearly see in the example  from a human’s perspective and judge that , “the  barked at” has more information to simple “the”, “barked”,”at” individual association with the word “dog”. And  the more info there is,  more able the machine is to learn the information.
Skip-gram model vs CBOW model can be correlated  to as ” 1+1″ vs  ” 5 ”
Skip-gram only know that  “dog” occurs near the word “barked” or “the” or “at”. However CBOW, is able to infer that “dog” occurs when  “the” ,  “barked” and “at” are collectively present i.e CBOW is able to learn more contextual information.

Question : well, excellent does that mean, the more window size we use for the CBOW, the better it gets, since we are adding more information with each extra words added to the window?
Answer : Unfortunately,  this is not the case, increasing the word window does not increase the  robustness of the model. Because with the addition of the new word, while we are thinking that we are adding more information, in fact it might be the case that we are adding more noise. For example, in the above case,
 with window size  3 :: input:['the','barked','at'],output:'dog'
 with window size  10 :: input:['the','barked','at', 'the', 'mailman', 'dropped', 'mail' , 'and', 'went','away'] ,output:'dog'

the later words  are more of a noise /  irrelevant information than the information i.e “dog” and words such as “dropped”, “mail” has no relevance.

Question : Well what if we use the full stop  to limit our window size. Will it work ?
Answer: It might work and produce better results, however full stop are stop words in English and this will not be applicable across other languages. It will transform the Deep learning model from a generically applicable across all languages to specific one, applicable for English language only. Furthermore, We will have to explore this more however !!

Question :  Why do we call this model, Bag of words. Is there any particular reasons behind it ?
Answer:  Yes there is, as the phrase suggests, , in this architecture the order of the words does not matter and we treat all contextual words as a bag of words, hence the name “Bag of words”

Question: What does continuous bag of words do ?
Answer: Continuous bag of words try to predict the words from a context of of words.In this model a text, is represented as a bag of words, disregarding grammar and even word order but multiplicity is considered.

Question: Where is it commonly used ?
Answer: It is well used for document classification, where frequency of some word occurring is used as a feature. e.g A document talks a lot about a boat, then we know that the document is about a boat.
It can also be used on image to do image classification. i.e If a image contains a lot of cats then the image can be categorized as candidate for cat search.

Question: Can you explain CBOW with a example?
Answer: Let’s say we have three document
1. “John like movies. Mary like Movies too.”
2. “John like football too”
3. “Bikal like Mahiene learning too”

Based on these three documents, a list is created
[John, likes, movies, Mary, too, football, Machiene, Learning, Bikal ]

Now, with CBOW, the documents are represented as,
[John, like, movies, Mary, too, football, Machiene, Learning , Bikal]
Document 1 : [1 , 2 , 1 , 1 , 1 , 0 , 0 , 0 ,0]
Document 2 : [1 , 1 , 0 , 0 , 1 , 1 , 0 , 0 ,0]
Document 3 : [1 , 1 , 0 , 0 , 1 , 0 , 1 , 1 ,1]

Now from this vector composition, we can easily know that the document 1 and document 2 are related.
And when we want to search for the Document related to Machine Learning, then it becomes evidently clear that “Document 3” will be the one we will be searching for.

Question : But I see a lot of problem already in the early example too. One of them, is in Document 1 there are two like. Does that it mean the document is more relevant to “like”?
Answer : No, the document is not more relevant to “like”. There are certain words, that appear too often, like a, an, the. These need to be dealt with, one of the most popular way to deal with it, is to weight a term by the inverse of the document frequency i.e
(term frequency in the document) / (total term frequency across all document).

To know more,  please click here How does Term Frequency – Inverse Document Frequency ( tf-idf ) work

Question: How does the Word2Vec- CBOW model work?
Answer : The main idea is that instead of capturing the co-occurrence count, we predict the surrounding words i.e n words to the left and the n words to the right. And  while tryoing to do so, its nice side effect is that the words are correctly plotted in the vector space.

word2vec - word vectorisation with Skip-gram.JPG

Question: Well, it now does capture the Semantic Relationship.  However we observe that the dimension (matrix size has increased further more? Is there any problems with the increased matrix size?
Answer: Well, yes, we have captured the phrases now as well, however the corpus size has grown exponentially. In the above example as well, we observe that the list has grown from 9 to 12 as we moved from 1-gram to 2-gram. As we move to higher grams the corpus size will grow exponentially. Ans this makes the CBOW model extremely expensive.

Question: So with higher grams the computation intensity grows exponentially, so how can we reduce the computation cost?
Answer: We can use the feature hashing i.e instead of using large sparse and growing dictionary with training size, we can limit the feature size or matrix size to n features, where we map the words or n-grams to features using some hash function. The hash function prevents the need of the associative array or dictionary, which would have required memory.
This surprisingly works well due to the large word sparsity. Further more, it is also able to handle the misspellings. In case of spam filtering, this hashing technique was used in spam filtering at yahoo. For example,
if our feature vectors is [cat, dog, cat] with hash(cat) = 1 and hash(dog) = 2.
Across a 4-vector dimensions, our representation then will be sth as [0,2,1,0]

Question : Nice trick, Can we use it elsewhere? Where else can we use it?
Answer: This hashing trick can be used to not only the text classification but can be applied to any problem that involves large number of features.

Question : What loss function do we need to use for the Stochastic Gradient Descent -“SGD” in Neural Network – NN?
Answer:  We will use  noise-contrastive estimation (NCE), – tf.nn.nce_loss().
This can be explained as . Let’s say we want to predict “the” from  “quick”, then we select a certain number of noisy examples at random + correct examplei.e
sheep for quick,  man for  quick – 2 noisy examples
Now  we calculate the loss function, for those examples ,
Text loss function.JPGWeight Update with : derivative of J(neg)
Repeat n times

Question: Can we visualize it?
Answer : Yes we can visualise the  vectors by projecting them down to 2 dimensions with  t-SNE dimensionality reduction technique.

Question : How is word2vec fast?
Answer:  In word2vec, we do not take all the words, if we do so it will be exponentially expensive, hence what we do is take the positive word and then take n-1 negative word. And this works really well, since there are a lot of words that never occur in the context of each other e.g mitochondria and republicans, zebra and laptop/ tv . Hence random sub-sampling of non co-occurring words is effective solution.

Question: What is the most common windows size ?
Answer : You can use any windows size, but most common is 5 – 10.

Question : With window size, does context position matter ?
Answer : No, it doesn’t. And while in fact it might seem considering word position ie left or right should be done. in fact it is desirable to not take position into account. consider the passive and active voice e.g
Bank has debt problem  vs Debt problem is possesed by bank.

Question : What sort of relations does the word to vector capture?
Answer : They are able to capture
a. Syntactic / Grammatical Relationship
grammatical relations.JPG
b. Semantic Relationship : “is a” relationship i.e car is similar to bus.In the example below, it would be Shirt – clothing is similar to chair – furniture
syntactic relationship

Question: What is the difference between old count based methods and newer Skip-gram models?
count vs skip-gram models NLP.JPG

Question : Well, we see that  while the skip-gram has improved performance, it lags in speed mainly. Is it possible to combine both of the worlds so that we can get benefit of both?
Answer : Yes, it is possible.  Instead of the exponential loss function we had for word 2 vectors, we will use a simpler loss function which will be able to capture the same vector relationships. Furthermore, to gain speed instead of running the model over fixed window size, we will run it over the whole corpus of words. With this we will be able to capture  benefits of both the world. This model is known as GloVe (Global Vector) model.
Glove model NLP word vectors.JPG

For more information See, NLP : with Global optimal Word Vectors

Question : Are NN immune to skewed class distribution ?
Answer :  No they are not, For example, if we have 95 positive sentences in sentiment analysis and 5 negative sentence and we train a NN model our of it, then it might be difficult for our NN to learn the proper model to find the exact sentiment from it.window based cooccurence matrix.JPG

Ref :