Question : Why  use document vector / Paragraph vector, when we already have the word2vec approach ?
Answer: In word 2 vector approach, we consider a certain window size, often 10 for skip-gram models and 5 for Continuous Bag of Words ( CBOW ) model. This means that  for each words that appears multiple times across multiple windows, we have to recompute them again and again, paragraph to document 2 vectors is a approach of reducing this unnecessary repetition / inefficiency by counting the word co-occurrence across all the data-set, all at once.

Question: How can we compute this Document to Vector?
Answer:  There are two ways to  convert the full document into vector.
1. Windowed approach: this is similar to word2vec, where we use windows around each word.
2. Word-document co-occurrence matrix : Computing the co-occurrence matrix over all the document.

Question : What is the difference between this two approach  Windowed approach vs the Word-Document co-occurrence matrix approach ?
Answer :  The two approaches capture two different aspects.
1. Windowed approach captures the semantic and syntactic information. It is often used in cases such as machine translation.
2. While, Word-Document co-occurrence captures general topic. e.g boat, rudder, fins , captain, ship, crew all appear in this boat topic together.

Question: Can you explain with a simple example, how you compute Document 2 vectors using a windowed approach ?
Answer: Let’s consider a small  three sentence corpus, with window length 1, not considering the word context.

The  word co-ocurence matrix for the simple corpus would then be,

Fig 1.

Now based on the first record, we can infer that “like” and “enjoy” is somewhat similar based on its interaction with I, than the other interaction of  “like” and “deep” , based on the co-occurrence matrix. And here we have found the verbs

Question : Nice, but what if my document is very large? Is this model scale-able?
Answer : Unfortunately, the co-occurrence matrix becomes very very large and very very sparse with the document size. For example millions  and millions of words means millions of millions of size sparse matrix.

Question: How can we solve it ?
Answer : We can re-represent the matrix  using low dimensional vectors i.e store most of the important information while discarding less important information. We usually use 25-1000 dimensions.

Question : Nice idea, but how do we do it?
Answer:  We can use a couple of them such as ,
1. SVD
2. Direct Learning

### Singular Value Decomposition – SVD

Question: How does SVD work ?
1. Singular Value Decomposition :   Decompose the x into orthogonal columns, U, S and V and instead of taking all the orthogonal columns of U, S and V we can take the top k principals as identified as important in terms of least squares.

See SVD in action, by taking the two top orthonormal columns  in the earlier document as shown in Fig 1. For code see Word vector with SVD

Figure 2.

Question : Well, now we have seen that the SVD captures the word vectors efficiently in the windowed document approach, can we improve them further ? If yes how?
Answer :  There  exists numbers of techniques to increase their efficiency further. Few of them being,
1.  Large Word Frequency Optimization :  We can observe easily, that there are couple of words that co-occur too frequently and  carries no significant information e.g “the” occurs with most of the nouns. We solve this by,
– Word Frequency Clipping : Clip  such most prominent words to min (X,t) with t ~100
– Ignoring them all
2. Word Proximity Score : We  give more weights , score and hence more prominence to the words that appear close together, rather than words that appear 5 words away.
3. Pearson Correlation :  Compute the Pearson correlation between the two words. i.e if two words are correlated more, then it means they are  more  similar to each other. However for the words that  have negative correlation values, we clip the lower minimum values to 0.

Question :  Now with all this, can you show with an example, the syntactic/ semantic pattern captured by the above Document 2 Vector?
Syntactic Pattern : In the picture below, we can observe that the different forms of grammatical  variation of the word  i.e {show, shown, showing} , {take, took, taken } always exists  together, capturing the syntactic patterns

Fig. Semantic Pattern Captured                          Fig. Semantic Pattern Captured

Semantic Pattern: Similarly, we can see that the verb and nouns occur in similar euclidean distance, thus capturing the semantic relationship. E.g in the above picture the drive and driver, clean – cleaner are always placed in equal distance.

Question : Is it applicable across all languages ?
Answer:  Generally it works for a lot of languages, however in some cases, this frequency based co-occurrence matrix approach will fail. For example a  Finnish / German language has a lot of words  and richer morphology and has compound nouns, that means we get more and more rarer words. And the rarer words we have , the lesser of the word frequency we will have and this approach in its vanilla approach will be harder to use.

Question : How do we solve the problem of capturing the vectors over more richer languages such as German and Finnish then?
Answer : We will then use the character based natural language processing.

Question : Well Singular Value Decomposition, ( SVD ) can reduce the dimension’s  in a good way and is simple to implement. But does SVD scale well i.e Is the SVD fast enough for millions of words or documents?
Answer : Unfortunately, SVD does not scale well with  millions of words or documents.
It is computationally expensive  i.e for n x m matrix, efficiency is O (m xn^2)
Further more addition of new words to documents to SVD is also hard i.e every time we have new documents, we have to  run the SVD again, which is not good.
It requires huge amount of RAM.

Question  : We have the count based method of the SVD and the co-occurence matrix and  the window based / direct prediction method as in skip-gram models ? What is the diference between them?

Question : So how can we develop a scaleable model?
Answer:  We can develop a scaleable model using the second method i.e by directly learning word vectors. Most popular among the direct learning methods, which  learn vectors with back propagation is Word2Vec and Glove models.

To know more on Word2Vec, See Learning Word Vectors directly with Word2Vec
To know more on (Global Vector model ) Glove Model, See Learning word vectors with Glove (coming soon)

### MISC Questions

Question: In which cases does  the window based word vectors  fail?
Answer: The window  based word vectors fails  for
Problem 1. Capturing multiple meanings for the same word.
Problem 2. Fails for frequent words such as a, an ,the, is , was e.t.c

Question : How do we solve the Problem 2  i.e of too frequent words?
Answer:  Too high frequent words in case of whole document could be solved with the tf-idf. However in case of a windowed approach,  we will not be able to use tf-idf,  since we are not looking over all the documents but a specific window size only. Hence, the solution is
1. To ignore the most common words all
2. Slightly better way is to clip the word frequency to some constant max number.  As we observed in the Figure 2, the word “like” with higher frequency has special position in the word vector space, but such too frequent words rarely add any value.
i.e min(X,t)

Questions: Representing the word as vector, isn’t it problematic in that it is not able to grasp multiple meaning of a word ?
Answer: Yes it is problematic.  It is not able to grasp multiple  meanings for the same word e.g bank could be money bank,  bank of a river e.t.c Such difference are not captured by word-vector. One way of capturing it is by using multiple  word vectors for the words such as bank_1, bank_2 e.t.c. However in general case, the word vector is good enough to work.

Question : Does too frequent for common words such as “the”, “he” e.t,c affect the word2vectors ?
Answer : Yes , it does and there exists multiple ways to deal with it
a. Clip the word frequency  i.e. use min( X, t), with t ~100
b. Use ramped windows i.e closer word have more strength
c. Use Pearson correlation and remove most correlated words.
c. Or Remove such words all together. This will depend on your problem scope i.e if you are predicting next words, then such words, the, of has relevance. However in other cases as to resolve syntactic ambiguity, it may not be so important.

References:

Applied Deep Learning with SVD