Dense Word Vector Representation

Question : Why Dense Vector representation of the sparse word co-occurrence matrix?
Answer: The advantages of the denser word vector over the sparser co-occurrence word vectors  approach are,

  • Scalable Model :  Addition of  each new word, is easy
    • Does not increase training data size exponentially
    • Low data size foot print
  • Generic model :
    • Dimension truncation during the dense vector making process,  reduces the specificity / over-fitting possible earlier.

Dimensionality reduction from the sparse word co-occurrence matrix, does not only solve the “Curse of Dimensionality” i.e need of more training examples and more parameters to fine tune, but also helps in generalization. e.g when a car and automobile are represented as different words, then it is harder for the the words independently closer to  both i.e car and automobile to be captured. e.g
Car has gear box.
Automobile has  transmission box.
Now since car and automobile, in sparse word co-occurrence is represented as different words, its hard to capture the meaning that the gear box and the transmission box mean the same thing.
However, if the car and automobile was somehow represented as the same dimension then we could have captured the essence  more easily and hence generalization would have easy.

Question : Ok, I agree with Dense vectors, but how can I create the dense word vectors ?
Answer:  There exists three different ways to create short dense word vectors. The techniques ranges from simple transforming the sparse vector to denser vectors as in SVD, to using predictive model such as Neural Network based models, in which we try to minimize the loss between the target word and the context words.

  • Count Based : SVD – Singular Value Decomposition
  • Prediction Based : Neural  Network Language model – skip-gram and CBOW
  • Brown Clustering

To know more, click  Count Based vs Prediction Based Word Vector


Count Based Dense word vectors – Singular Value Decomposition (SVD)

Question : Why does SVD work better ?
Answer :  SVD is a technique of dimensionality reduction. It stores most of the important information in a fixed, small number of dimensions.
SVD  works by reducing the data dimension by rotating the dimension axes into new space and taking the only relevant dimensions and rejecting lower order   new dimension. And it works perfectly well, because

  • The removal performs denoising i.e low order dimensions which may represent unimportant information is removed.
  • The removal helps the model generalize better.
  • Low dimensions solves the Curse of dimensionality problem. Makes it easier to tune the classifier parameter.

Question : We earlier talked about storing most of the important information in fixed, small number of dimension? How do we determine how many dimensions are important enough?
Answer: It depends on the domain and the data itself, however for word vector, often  25-1000 are found to be important dimensions and are preserved with SVD.

Question: How does SVD work ?
Answer:Singular Value Decomposition :   SVD  allows us to  factorise a matrix as a 3 matrix, U, S, VSVD - 1 matrix to three matrix.JPG
where U is the left singular vector
V is the right singular vector
and S is the singular value.

SVD generates  N- dimensional dataset  from large dimension,  by rotating the dimension axes into new spaces, such that the highest order dimension captures the most variance  and so on. Decompose the x into orthogonal columns, U, S and V and instead of taking all the orthogonal columns of U, S and V we can take the top k principals as identified as important in terms of least squares.


Question : Can you show a SVD in practice with a example?
Answer: Yes, we can see the SVD in action with a image example

SVD1- image example.JPG  SVD- image example 2.JPG

Left Fig    : Top – original Image . Bottom: Image with SVD – 1st dimension
Right Fig : SVD with dim 2, 20 and 50 respectively.


Question : Can we  simulate the SVD sparse to dense vector conversion, so that we can understand the concepts better?
Answer : Sure,  let’s consider a word co-occurence matrix  from a document corpus as,window based cooccurence matrix

Correlating it to  our word vectorisation example, let’s decompose the document into word co-occurrence matrix  as in the figure on the side. The word co-occurrence matrix can be  converted into 3 matrix U, V and S.
In our case the U matrix will capture the word on x-axis to concept essence.
The vector V captures the word on y- axis to concept essence.
Ans, the vector S captures the strength of the relationship.
And since in our case the words in the U and the V are the same, we can consider only  one vector U or V, since both of these capture the same word to concept essence.
However, if these axes-es were different, then both of them would had been relevant e.g For a matrix showing the likes of  person  to each movie, then the vector U would likely have captured person to  movie concept essence such  as people-genre, V would have captured movie to concept  such as movie-genre and S would have captured how  strongly people like each genre.

In our word case, See SVD in action, by taking the two top orthonormal columns  in the earlier document as shown in Fig 1. For code see Word vector with SVD

NLP word vectors with SVD.JPGFigure 2.

To conclude, with SVD , what we are ultimately doing is mapping each terms to the term concepts.

Question: What does U, V and S vector matrixes in the SVD represent, in case of word vectors ?
Answer: In case of the word vectors,  SVD transforms the |V| x c term-document matrix to  the U word vector matrix and V word-context  matrix.

Question :  SVD nicely  transforms the sparse vectors into dense vectors, but is there any disadvantage to using it?
Answer: Well, ther are couple of disadavantage, as listed below

  • Primarily Scalability : Addition of new words is not supported. Since SVD is constructed from a co-occurence matrix, if  a new term or new document is added, the word co-ocurence matrix will have to be updated and the SVD will have to be computed again. This makes it a unscalable model.
  •  Slow :  Comparatively SVD is slower, compared to  other  techniques such as Neural Netword based dense word vectorisation appraoch.

Question: Can we  elaborate the SVD, pictorially?
Answer: Dense word vectorisation  with SVD  can be pictorially  elaborated as,

SVD -  Word Vectors.JPG
Figure :  Using SVD to create word dense vectors 

Prediction Based Dense word vectors  : Neural Network Based

Question  : Why   Neural Network based word vectorisation?
Answer: Neural Network based word vectorisation is faster, easier to train and is much more scalable, hence we prefer NN- based word vectorisation.

Question: How does Neural Network based word vectorisation work?
Answer :  We used different neural network architectures, which tries to solve different task such as

  • CBOW model: It predicts a word given context words.
  • Skip-gram model :  It predicts context words given a word.

As a nice side effect to the main problem, we observe that the NN- based approaches do a excellent job of capturing word vectors very efficiently.

Question : What is the intuition behind the neural network based  dense word vectorisation?
Answer: The intuition is that the  words with similar meaning often occur next to each other. With that understanding, we initially initialize  each words randomly. However as we go on training our model we shift a word’s  embedding vector  to be more like the neighboring words and less like the embedding of the words that don’t occur nearby.

Question : How does Continuous Bag of Words work?
Answer : To know more about the Continuous Bag of Words vectorisation approach, please  refer to  other post by clicking here  at “Contiguous Bag of Words

Question : How does Skip-gram model work?
Answer : To know  more about the skip-gram model, please  refer to  other post by clicking here  at “Skip-gram Word Vectorisation