Question:  How can we capture the Word Similarity essence to convert word to vectors?
Answer:  We can capture this word similarity essence using four kinds of vectors models

• Sparse  Word Vector Representation
• Co-occurence Matrix based Sparse Vector Representation
• Dense Word Vector Representation
• SVD (Singular Value Decomposition) based representation
• Neural Network based models i.e skip-gram, CBOW
• Brown clusters

### Co-occurence Matrix based Sparse Vector Representation

In this  post, we will be focusing on the Sparse Word Vector Representation w based on Co-Occurrence matrix

Question : How is the document converted into co-occurence matrix?
Answer :  Let’s say we have three document
1. “John like movies. Mary like Movies too.”
2. “John like football too”
3. “Bikal like Machine learning too”

Based on these three documents, a word term list is created as
[John, likes, movies, Mary, too, football, Machine, Learning, Bikal ]

Now,  this list is converted into co-occurrence matrix as ,
[John, like, movies, Mary, too, football, Machine, Learning , Bikal]
Document 1 : [1 , 2 , 1 , 1 , 1 , 0 , 0 , 0 ,0]
Document 2 : [1 , 1 , 0 , 0 , 1 , 1 , 0 , 0 ,0]
Document 3 : [1 , 1 , 0 , 0 , 1 , 0 , 1 , 1 ,1]

When we create  the term frequency matrix as in thee picture below, then we can easily deduce the two similar words i.e Julius Caesar and Henry V are similar because they  have common contextual words soldier, battle.

Question : Well, this looks fine for a 4  *4 word matrix, but in reality we have millions and millions of words.  If we follow the same approach, our real matrix then will be of the size of million * million very sparse matrix ?  How can we solve this problem?
Answer:  There are a lot of algorithms to deal with sparse matrix that we can use from storing to processing .
Some of the Sparse matrix Storage algorithms being

•  Coordinate Storage
• Compressed Sparse Row
• Compressed Sparse Column
• Block Sparse Row

Some of the Sparse Matrix Information Retrieval Algorithms are

• Interpolative Huffman
• Golomb-Gamma
• Byte-Aligned-Byte
• Gamma – Gamma

However, we will not be discussing on the sparse matrix handling, in this post

Question : What window size should I consider ?
1. Shorter window (1-3) is preferred for more syntactic representation
2. Longer windows size (4 -10) is preferred for more semantic representation.

Question : But I see a lot of problem already in the early example too. One of them, is in Document 1 there are two like. Does that it mean the document is more relevant to “like”?
Answer : No, the document is not more relevant to “like”. There are certain words, that appear too often, like a, an, the. These need to be dealt with, one of the most popular way to deal with it, is to weight a term by the inverse of the document frequency i.e
(term frequency in the document) / (total term frequency across all document).

To know more,  please click here How does Term Frequency – Inverse Document Frequency ( tf-idf ) work

Question : Well now, we have represented the word into vectors, how do we determine their similarity?
Answer: The most simple way of computing the similarity across the vectors is with simple dot product.

Question :  But what about the dot product similarity measure’s bias to the vector length i.e word frequency ?
Answer : Since the vectors are longer if they have higher values in each dimension, meaning more frequent words will have higher vector length, it will be a bad idea to use the similarity metric that is sensitive to word frequency. Hence we use the Cosine similarity.

Question : What is cosine similarity?
Answer:  We can normalize the dot product by using the vector lengths. This turns out to be the cosine of the angle between them.

Question : Can we see the cosine similarity  with example?
Answer :  In the picture below, from the three word vector example, we can easily see that although data information has high  word co-occurrence frequency and hence larger length vector, it does not add bias to the word similarity when computed  using the cosine similarity.

Question :  Ok. But what about same synonym word, being represented as separate dimensions in the above case . For example,  good, fine, high quality, high standard e.t.c? They are represented as distinct words, hence distinct dimension. Does the Curse of dimension, not apply here?
Answer:   Well, although such words relativeness is captured  by the co-occurrence matrix, their distinct representation as individual words, create the dimensionality problem. We can solve it by reducing the dimensions  using various techniques, such as SVD, and ultimately translating the sparse vectors into the dense vectors.

Question : Are there any drawbacks other than, high dimension size with Co-occurrence based word vector approach ?
Answer:  There are several problems with Co-occurrence based word vector approach

• Unscalable model: For every addition of a word, the  co-occurrence matrix  size increases exponentially. e,g
1 word – Co-occurrence matrix size : 1 * 1
2 word – Co-occurrence matrix size :  2 * 2
3 words – 3 * 3
• Dimensionality Problem
• Huge space needed
• Curse of Dimensionality :
• More dimension : Hard to tune parameters optimally
• Less robust model :
• Synonym as different discrete words : Because the model is not generic enough, as every word synonym is represented as each dimension i.e good, best,  fine, is represented as separate three words vs something similar to “good”.

Question : Why Dense Vector representation of the sparse word co-occurrence matrix?
Answer: Dimensionality reduction from the sparse word co-occurrence matrix, does not only solve the “Curse of Dimensionality” i.e need of more training examples and more parameters to fine tune, but also helps in generalization. e.g when a car and automobile are represented as different words, then it is harder for the the words independently closer to  both i.e car and automobile to be captured. e.g
Car has gear box.
Automobile has  transmission box.
Now since car and automobile, in sparse word co-occurrence is represented as different words, its hard to capture the meaning that the gear box and the transmission box mean the same thing.
However, if the car and automobile was somehow represented as the same dimension then we could have captured the essence  more easily and hence generalization would have easy.

Please click , “Dense Word Vector Representation“, to know more