Question : What is the fundamental idea of the Word 2 vec models?
Answer : The underlying fundamentals of NLP with deep learning – or word vectorisation or word 2 vec is that “Similar words occur in almost the same environments” e.g
“oculist and eye-doctor.. occur in almost the same environment”

You shall know a word by the compay it keeps.JPG

Question: How do we capture the fundamental idea in the word2vec models ?
Answer: The idea  is captured by maximizing the similarity  for the words which appear closer together  and minimize the similarity of the words that do not.

Question : Can you explain the Word2Vec fundamental to me with more elaborate real example ?
Answer : If we have a text as in the picture below, then we, humans can easily understand that the  “Tesguino is an alcoholic beverage like beer”

                                         Fig   REF : Word 2 Vector Example text
But for computer, it is not intuitive, we can make it intuitive to the computer by developing a algorithm that captures the the above discussed principle
“Two words are similar if they have similar word contexts.”

Question:  How can we capture this Word Similarity essence to convert word to vectors?
Answer:  We can capture this word similarity essence using four kinds of vectors models

  • Sparse  Word Vector Representation
    • Co-occurence Matrix based Sparse Vector Representation
  • Dense Word Vector Representation
    • SVD (Singular Value Decomposition) based representation
    • Neural Network based models i.e skip-gram, CBOW
    • Brown clusters

Question: What algorithms do we use for training word vectors?
Answer:  Word vectors are learned from the corpus of the text using back propagation. Initially words are randomly assigned some vector values. Then during the training, the word vectors are updated using back propagation such that the closer words appear closer together. i.e We maximize the similarity metrics between the closer proximity words and try to maximize the distance between the  dissimilar words. During back propagation the word vectors are updated,  with the above principle in hindsight.

Question : Why Word2Vec?
Answer: Since learning the vector representation of word with co-occurrence approach is very expensive, even when the dimensionality reduction techniques were used,  to make the model more scalable we learn the word vectors directly using back propagation. and Word2Vec is one of the most popular methods to learn such vector representation in the direct learning approach.
If you want to  know more about the older Word To Vector approach with co-occurrence matrix, see Word To Vector Conversion with Co-occurrence Approach

Question : There are million of words,  so is considering all the words during the training process computationally feasibly?
Answer : Considering all the words  during the training is computationally  infeasible. There are two ways to reducing the training corpus is
1. Sample a small size of the non related word pairs, and consider all the related word pair by negative sampling.  This works better for frequent words and with low dimensional vectors.
2. Hierarchical Soft-max :  Hierarchical soft-max is a efficient way of computing the soft-max using the Huffman tree. It works better for infrequent words however as training epoch increases, it stops being useful.

Question :  Ok, we have dealt with large corpus problem. But what about the words that occur frequently, such as article (a, an the e.t.c) , preposition (of, in, on, at e.t.c)?
Answer: Since such high frequency words rarely provide any important information,  we can completely delete such words, if we are not too much concerned about capturing the syntactic relationship ( e.g noun always comes after article.) However, if we are concerned about capturing the syntactic relations as well, then we can clip the frequency to certain max frequency. Sub sampling increase the training speed.

Question :   Ok, we have dealt with the  training size reduction and high frequency little info carrying words,  what about the context windows size, i.e what should be my window size during the training?
Answer: The size of context of window determines how many words before and after a given word would be included as context words during the training. According to “code.google“, the recommended value is 10 for skip-gram models and 5 for Continuous Bag of Words (CBOW) model.

Question :  Can we not  train the model from the whole document, since that will be faster, as it  means that the same words will not have to be dealt with multiple  times at different context window?
Answer: Yes, the whole document can be used to construct the word vectors. This extension   was proposed in 2017  “Distributional Representation of Sentences and Documents” .
This extension is also known as Paragraph2vec or Doc2vec.
For more details , please click here “Doc2Vec / Paragraph2Vec

Question : Also, during training, what dimension size should be considered?
Answer: Generally, quality of word embedding increases with higher dimensions. However this  quality increase cannot be  unrestrained, after some point the marginal gain will diminish. Typically the dimensional of the word vectors is set between 100 and 1000.

Question: What does a word represented as vector look like?
Answer : A simple word vector for a word “expect” as represented in 8 dimensional vector may look like follows. However in practice we use word vectors of size 1000 dimensional vectors or so.
expect = {0.28
0.21
-0.29
0.68
0.34
0.19
-0.9
}

Question : Why higher dimension vector representation is important?
Answer: What has been found is that we we plot these words in the higher dimensions then, these spatial dimensions and clusters are found to perform some sort of semantic spaces. For example all the nouns may be clustered around in one space of 1000 dimension vectors. IF we dig deeper, we might also find the the animal and humans are clustered around each other in different regions of the noun space.

Question: What is the objective of the Word2Vec?
Answer: The objective of the word2vec, is how to build a simple, scaleable model that is able to learn the word representation for billion and billions of words.

Question : How is Word2vec trained?
Answer: There are two algorithms we use to train word2vec.
1. Skip-grams
2. Continous Bag of Words
And there exists two training methods
1. Hierarchical softmax
2. Negative Sampling

Question : Should we consider the word’s position i.e put more weight to the context words when they are near vs when they are too far apart?
Answer: In fact, it depends on the situation. If we are trying to capture the syntactic representation for the word vector, the adding more weights to the words proximity gives more better representation. However what has been found is that for the semantic representation i.e word meaning, they are better captured when the word’s proximity is not taken into account.

Question : What is the underlying principle of converting word 2 vectors?
Answer : Learning vectors of the word is a unsupervised task. You might have already gone through Why supervised way of dealing with Text was a bad idea in my earlier blog.
See NLP with Supervised Approach
The unsupervised way of handling NLP, is what makes it powerful in terms of scalability, applicable across any language and adaptable to changing language corpus, as new words gets added. However, to convert the word into vectors, we need some way of learning the vector representation.
CBOW and “Skip-gram” are two different architectures that try to do two different job. CBOW tries to predict a word given a continuous bag of words, while “Skip-gram” model tries to predict the context given a word.
However the nice side effect of these two model is that, in trying to do their primary objective, they represent the word vectors in the correct vector space, capturing the semantic and syntactic correctness.

Question : What are different ways we can do Word2Vec?
Answer : In vector 2 space, there are two architectures. While these two different architectures try to do two different tasks,  the nice side effect in trying to to capture the primary objective is that they convert the words into vectors in a correct way
1. Continuous bag of words (CBOG): With CBOG,  given a continuous bag of words  e.g “the cat sits in the “, we predict the target word i.e “mat”
i.e Input =  {W( i-2 ), W(i-1), W(i+1), W(i+2) } , Output = W(i)
2. Skip-Gram model : Skip-Gram model does the inverse of the  CBOG i.e Given a word “mat”, we predict  it is related with “cat” – 3-gram distance away, “sit” – 2-gram distance away.
i.e Input =  W(i)  , Output = {W( i-2 ), W(i-1), W(i+1), W(i+2) }

Question : Can you explain how CBOW and skip-gram differ, to me like I am a 5- yr old child?
Answer:
CBOW is        “The cat ate ____.”
Skip-gram is “__  ___ ___ food”

Advertisements