Dense Word Vector Representation
Question : Why Dense Vector representation of the sparse word co-occurrence matrix?
Answer: The advantages of the denser word vector over the sparser co-occurrence word vectors approach are,
- Scalable Model : Addition of each new word, is easy
- Does not increase training data size exponentially
- Low data size foot print
- Generic model :
- Dimension truncation during the dense vector making process, reduces the specificity / over-fitting possible earlier.
Dimensionality reduction from the sparse word co-occurrence matrix, does not only solve the “Curse of Dimensionality” i.e need of more training examples and more parameters to fine tune, but also helps in generalization. e.g when a car and automobile are represented as different words, then it is harder for the the words independently closer to both i.e car and automobile to be captured. e.g
Car has gear box.
Automobile has transmission box.
Now since car and automobile, in sparse word co-occurrence is represented as different words, its hard to capture the meaning that the gear box and the transmission box mean the same thing.
However, if the car and automobile was somehow represented as the same dimension then we could have captured the essence more easily and hence generalization would have easy.
Question : Ok, I agree with Dense vectors, but how can I create the dense word vectors ?
Answer: There exists three different ways to create short dense word vectors. The techniques ranges from simple transforming the sparse vector to denser vectors as in SVD, to using predictive model such as Neural Network based models, in which we try to minimize the loss between the target word and the context words.
- Count Based : SVD – Singular Value Decomposition
- Prediction Based : Neural Network Language model – skip-gram and CBOW
- Brown Clustering
To know more, click Count Based vs Prediction Based Word Vector
Count Based Dense word vectors – Singular Value Decomposition (SVD)
Question : Why does SVD work better ?
Answer : SVD is a technique of dimensionality reduction. It stores most of the important information in a fixed, small number of dimensions.
SVD works by reducing the data dimension by rotating the dimension axes into new space and taking the only relevant dimensions and rejecting lower order new dimension. And it works perfectly well, because
- The removal performs denoising i.e low order dimensions which may represent unimportant information is removed.
- The removal helps the model generalize better.
- Low dimensions solves the Curse of dimensionality problem. Makes it easier to tune the classifier parameter.
Question : We earlier talked about storing most of the important information in fixed, small number of dimension? How do we determine how many dimensions are important enough?
Answer: It depends on the domain and the data itself, however for word vector, often 25-1000 are found to be important dimensions and are preserved with SVD.
Question: How does SVD work ?
Answer:Singular Value Decomposition : SVD allows us to factorise a matrix as a 3 matrix, U, S, V
where U is the left singular vector
V is the right singular vector
and S is the singular value.
SVD generates N- dimensional dataset from large dimension, by rotating the dimension axes into new spaces, such that the highest order dimension captures the most variance and so on. Decompose the x into orthogonal columns, U, S and V and instead of taking all the orthogonal columns of U, S and V we can take the top k principals as identified as important in terms of least squares.
Question : Can you show a SVD in practice with a example?
Answer: Yes, we can see the SVD in action with a image example
Left Fig : Top – original Image . Bottom: Image with SVD – 1st dimension
Right Fig : SVD with dim 2, 20 and 50 respectively.
Question : Can we simulate the SVD sparse to dense vector conversion, so that we can understand the concepts better?
Answer : Sure, let’s consider a word co-occurence matrix from a document corpus as,
Correlating it to our word vectorisation example, let’s decompose the document into word co-occurrence matrix as in the figure on the side. The word co-occurrence matrix can be converted into 3 matrix U, V and S.
In our case the U matrix will capture the word on x-axis to concept essence.
The vector V captures the word on y- axis to concept essence.
Ans, the vector S captures the strength of the relationship.
And since in our case the words in the U and the V are the same, we can consider only one vector U or V, since both of these capture the same word to concept essence.
However, if these axes-es were different, then both of them would had been relevant e.g For a matrix showing the likes of person to each movie, then the vector U would likely have captured person to movie concept essence such as people-genre, V would have captured movie to concept such as movie-genre and S would have captured how strongly people like each genre.
In our word case, See SVD in action, by taking the two top orthonormal columns in the earlier document as shown in Fig 1. For code see Word vector with SVD
To conclude, with SVD , what we are ultimately doing is mapping each terms to the term concepts.
Question: What does U, V and S vector matrixes in the SVD represent, in case of word vectors ?
Answer: In case of the word vectors, SVD transforms the |V| x c term-document matrix to the U word vector matrix and V word-context matrix.
Question : SVD nicely transforms the sparse vectors into dense vectors, but is there any disadvantage to using it?
Answer: Well, ther are couple of disadavantage, as listed below
- Primarily Scalability : Addition of new words is not supported. Since SVD is constructed from a co-occurence matrix, if a new term or new document is added, the word co-ocurence matrix will have to be updated and the SVD will have to be computed again. This makes it a unscalable model.
- Slow : Comparatively SVD is slower, compared to other techniques such as Neural Netword based dense word vectorisation appraoch.
Question: Can we elaborate the SVD, pictorially?
Answer: Dense word vectorisation with SVD can be pictorially elaborated as,
Prediction Based Dense word vectors : Neural Network Based
Question : Why Neural Network based word vectorisation?
Answer: Neural Network based word vectorisation is faster, easier to train and is much more scalable, hence we prefer NN- based word vectorisation.
Question: How does Neural Network based word vectorisation work?
Answer : We used different neural network architectures, which tries to solve different task such as
- CBOW model: It predicts a word given context words.
- Skip-gram model : It predicts context words given a word.
As a nice side effect to the main problem, we observe that the NN- based approaches do a excellent job of capturing word vectors very efficiently.
Question : What is the intuition behind the neural network based dense word vectorisation?
Answer: The intuition is that the words with similar meaning often occur next to each other. With that understanding, we initially initialize each words randomly. However as we go on training our model we shift a word’s embedding vector to be more like the neighboring words and less like the embedding of the words that don’t occur nearby.
Question : How does Continuous Bag of Words work?
Answer : To know more about the Continuous Bag of Words vectorisation approach, please refer to other post by clicking here at “Contiguous Bag of Words”
Question : How does Skip-gram model work?
Answer : To know more about the skip-gram model, please refer to other post by clicking here at “Skip-gram Word Vectorisation“