Here we will briefly  discus on the implementation aspects of the CBOW model, in regards to the  “CBOW implementation code“.

  • Step 1: Build the continous bag of words.

Question : So How do we construct the train data for the CBOW model ?
Answer : In CBOW model, as the name suggests we take n-near words as the input and the center word as the label.  For example, Let the document  be
CBOW_Data1 = “Washington is capital of US. ”
N-Near words = 2  i.e we are considering two near words  on either side of the center word. Then
train_batch = [‘Washington’, ‘is’ , ‘of’ , ‘US’ ]   and label = [ ‘capital’ ]
In the git hub code Github  Continous bag of words, N-near words is represented by variable “skip_window”.

Question: In the above   example, I see that  most commonly occurring words such as “is”, “of”  is used  with  assumption that  it has significant relevance to predicting the word “Capital”. Is it valid assumption?
Answer:  No, it is not a valid assumption.  Such most commonly occurring words such as  the above rarely  are significant. They add no information to predict  the label word “capital” and often add erroneous noise to predict the word label “capital”.  For example, the same sentence could be also written as below , showing that “is”, “of” has no relevance
CBOW_Data2 = “Washington has been the US’s capital,  since 1970”.
Hence, often the  Removal technique is often used, to remove such commonly occurring words.
In our github code, we use the build_dataset, to filter such most commonly occurring words out.

def build_dataset(words):

With commonly occurring words removed, the sentence CBOW_data1 and CBOW_data2 then becomes
CBOW_Data1 = “Washington is capital of US. ”          =>  “Washington capital  US”
CBOW_Data2 = “Washington has been the US’s capital,  since 1970”.  => “Washington capital  US 1970 ”
Question : Well, Is removing the most common words always desired then?
Answer : No, removing the most common words is not always desirable. It depends on the situation. If we are going to capture the semantics then removal of such common words is desired.
However if our goal is also to use the syntactic information, then it is not desired. e.g  noun often precedes a verb. i.e “Bill is hard-worker”.
As, ca be seen above, if we remove the common word “is”, then the  sentence becomes “Bill hard-worker”, which will not be able to capture the  syntactic information “Noun often precedes verb” and that “Bill” hence is a noun.

Question : Ok, I got the generic working of CBOW, but in the code how is the  input and output created, based on some text document?
Answer : First we need to prepare the  input and output data.  Lets explain the input and the output with the example.
If Document = “anarchism originated as a term of abuse first used against early working class radicals”
then for input, we need to do the following

  • Step 1.1 : Determine batch size

Let’s  consider the batch size to be 4. then our batch_data becomes

batch_data = “anarchism originated as a”

  • Step 1.2 : Determine  number of surrounding words

Let’s  consider the number of surrounding words to consider to be 1. i.e skip_window = 1, then our   input_batch and  output_label becomes,

Input                 =>   Output

[ anarchism, as ]  =>  [ originated]

[  originated, a  ]  =>  [ as ]

Hence in our code,

cbow_no_of_output_word is  :: cbow_no_of_output_word = batch_size (2 * skip_window

batch = np.ndarray(shape=( cbow_no_of_output_word , 2 * skip_window ), dtype=np.int32)
labels = np.ndarray(shape=( cbow_no_of_output_word ,1), dtype=np.int32)

 

We use the function generate_batch(batch_size, skip_window, reverse_dictionary = None , verbose = 0) to generate train dataset and train label. See code  for the function ‘generate_batch’

Step 2: Training CBOW Model

Once we have the  training data and  label, now we  need to train our model. The architecture of which , based on the code on the Github  Continous bag of words is

In the code, this look up is done based on the  “all_vocab_wv” variable and the vector average is done using Tensorflow codes, for better efficiency.

wv_1_batch_1_words = tf.nn.embedding_lookup( all_vocab_wv , train_dataset[:,i])

In continuous bag of words since we  have to come up with a label word  [ originated] from two  words  [ anarchism, as ] , we have to average the vector of these two words.

for i in range(2*skip_window):            
            wv_1_batch_1_words = tf.nn.embedding_lookup( all_vocab_wv , train_dataset[:,i])
            print('embedding for train_dataset[:,i] "',train_dataset[:,i],'" is wv_1_batch_1_words :: "',wv_1_batch_1_words)            
            emb_x,emb_y = wv_1_batch_1_words.get_shape().as_list()
            print('embedding_%d shape is : %s emb_x :"%s", emb_y: "%s" '\
                  %(i,wv_1_batch_1_words.get_shape().as_list(), emb_x, emb_y  ))
            if wv_1_batch_n_words is None:
                print('embed is None, hence reshaping  from %s to  (%s,%s)'\
                      %(wv_1_batch_1_words.get_shape(), emb_x, emb_y) )
                wv_1_batch_n_words = tf.reshape(wv_1_batch_1_words,[emb_x,emb_y,1])
                wv_1_batch_n_words = tf.Print(wv_1_batch_n_words,[wv_1_batch_n_words])
            else:
                print('embed is not None, hence  concating earlier wv_1_batch_n_words %s with current wv_1_batch_1_words'\
                      %(wv_1_batch_n_words.get_shape()) )
                wv_1_batch_n_words = tf.concat(2,[wv_1_batch_n_words,tf.reshape(wv_1_batch_1_words,[emb_x,emb_y,1])])

avg_wv_1_batch_n_words =  tf.reduce_mean(wv_1_batch_n_words,2,keep_dims=False)

 

For computing the word similarity, we use the cosine similarity in the code, since the cosine similarity is not sensitive to the vector lengths.

norm = tf.sqrt(tf.reduce_sum(tf.square(all_vocab_wv), 1, keep_dims=True))
normalized_embeddings = all_vocab_wv / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

 

Question : Wow, Amazing!! Word2Vec can capture the word similarity. Now I want to do more complex tasks  such as language translation.  Can I do so with word 2 vector models?
Answer : No, we cannot. Word 2 vector models take in a mono-lingual input i.e English language text only or french language text only. We need to have a model that can take in both English and French as inputs and then able to learn the translation.

See “RNN  for more details

 

 

Ref :

https://www.quora.com/How-is-word2vec-different-from-the-RNN-encoder-decoder

 

 

Advertisements