Problem: What is the major drawback of traditional Natural Language Processing Systems?
Answer: Traditional NLP systems treat  words as ids e.g Dog – Id10, Cat v- Id 200. Such Id representation is not able to  provide useful information to they system that may exist for example, both are pets and four legged e.t.c

Problem : Just out of curiosity, what were the traditional NLP models like ?
Answer : The traditional NLP models started with “N-gram approach”, where document corpus was used to  break the words into 1-gram, 2-gram or 3-grams. Then Naive Bayes i.e conditional probability was used to find out the next sequence of characters e.g
For sequence of words ending in wt2,wt1,wt,wt+1
Calculate P(wt+1|w1,,wt2,wt1,wt)  and predict most probable sequence amongst 1-gram, 2-gram or 3-grams.
This approach is commonly referred to as “N-Gram Language models”

Problem : How can we remove this major drawback?
Answer: By representing the words as vectors we can overcome some of this limitation. With vectors we can compute similarity, we can perform  vector subtraction, addition as in
dog – large = puppy
cat – large = kittenAnd vectors are less sparse to random ids.

Problem : How do the vector solve the sparsity Problem?
Answer: In Vector representation, words are represented in continuous vector space, with semantically similar words, clustered together. The words follow the “Distributional Hypothesis”, which states that words which appear in same context share same meaning. See Kitty and Cat example below.

Problem :  How can we learn semantic ambiguity ?
Answer : For example, kitty = cat, but how do we learn it . We can use unsupervised learning, where we provide a lots and lots of text to the deep learning model, which will then find a way to  learn those similarities.For example

Kitty like milk.
Cat likes milk.

From these text our model can the learn the semantic ambiguity i.e cat and kitten are of same class.

Problem : How do we  find the closeness it ?
Answer :  Use word2vec. For each word, map it to a embedding, then use the embedding to predict the   context of the word i.e the words that are nearby in the  window.  i.e The model will say “Fox” and “Brown” are close. Then we use logistic regression as a supervised model i.e
‘Fox’   -> ‘Brown’
‘Cat’   ->’ Brown’

The model will then learn that fox, cat like animals  have some link to ‘brown’. Although it does not know it’s color, it is able to grab the information

Problem : How is the closeness represented/ found ?
Answer : In N-d space, they are stored. Then we can use the nearest neighbor to find the  closest words. Another word is to plot the N-d space to 2D space. However if we use PCA, we will lose the closeness metrix, hence we will have to use the T-SNE.

Figure 1: Example of 2-dimensional distributed representation for words obtained in (Blitzer et al 2005).

Problem : So we have represented the data in vector space . So which metrics should we use to compute two similar words. L2 space or cosine
Answer : Because of the way embedding vectors are stored i.e T-SNE. the length of the embedding vectors are not relevant, hence we should always use the cosine difference.

Problem : There are many many words and using softmax over all could be very very computationally exhaustive. So what do we do ?
Answer :  Use Sampled Softmax i.e  Instead of taking all the words and the softmax for all, we take the correct one and then take a random n sample and perform softmax on it. This works very well and is computationally efficient as well.

REF :
1. Udacity Machine Learning
2. https://www.tensorflow.org/tutorials/word2vec
3. http://www.scholarpedia.org/article/Neural_net_language_models