Question :  What is the difference between the Count based and the Prediction models, which are used to capture the word semantic similarity?
Answer:   Count Based :  In count based models, the semantic similarity between words is learned by counting the co-occurrence frequency. For example, in the example
Kitty likes milk.
Cat likes milk.
the co-occurence matrix computed will be

Kitty likes Milk Cat
 Kitty  0 1  1  0
 likes  1  0  2  1
 Milk  1  2  0 1
 Cat  0  1 0

Based on the Count based matrix, we can deduce the Cat and kitty are related.

Predictive Models:  In  predictive models, the word vectors are learnt by trying to improve on the  predictive ability i.e minimizing the loss between the target word and the context word. i.e Initially Kitty and cat could have been randomly assigned as too distant. However in order to minimize its loss with the context words (“like” and  “milk”), both words will have to be close to each other in the vector space. And thus they capture the semantic meaning.

Question : If both capture the same information, which one is preferred method?
Answer : Count based methods calculate the co-occurrence matrix for all words, hence the tend to consume a lot of memory compared to the predictive models. Then the  dimensionality reduction is applied to the large matrix which lower dimensions, which in the process make the model more robust because it captures most significant information while loses less significant information or noise. However at the expense of higher memory consumption, one of its advantage is that is is easily parallelisable, since all we are doing is computing count across the documents. This means we can train over more data, 100s of GBs  of data  and get more accurate model, which is always a good thing, with  enormously large amount of text data easily available.

With prediction based model, since we are trying to minimize the loss over each small batch of data from the 100s of GB of data, the parallelisation is not possible at the file split and train level. More complex parallelisation technique, such as GPUs need to be implemented for them .However one of the advantage is that they consume less memory, since they don’t have to compute large co-occurrence matrices