Here we will briefly discus on the implementation aspects of the CBOW model, in regards to the “CBOW implementation code“.
Step 1: Build the continous bag of words.
Question : So How do we construct the train data for the CBOW model ?
Answer : In CBOW model, as the name suggests we take n-near words as the input and the center word as the label. For example, Let the document be
CBOW_Data1 = “Washington is capital of US. ”
N-Near words = 2 i.e we are considering two near words on either side of the center word. Then
train_batch = [‘Washington’, ‘is’ , ‘of’ , ‘US’ ] and label = [ ‘capital’ ]
In the git hub code Github Continous bag of words, N-near words is represented by variable “skip_window”.
Question: In the above example, I see that most commonly occurring words such as “is”, “of” is used with assumption that it has significant relevance to predicting the word “Capital”. Is it valid assumption?
Answer: No, it is not a valid assumption. Such most commonly occurring words such as the above rarely are significant. They add no information to predict the label word “capital” and often add erroneous noise to predict the word label “capital”. For example, the same sentence could be also written as below , showing that “is”, “of” has no relevance
CBOW_Data2 = “Washington has been the US’s capital, since 1970”.
Hence, often the Removal technique is often used, to remove such commonly occurring words.
In our github code, we use the build_dataset, to filter such most commonly occurring words out.
With commonly occurring words removed, the sentence CBOW_data1 and CBOW_data2 then becomes
CBOW_Data1 = “Washington is capital of US. ” => “Washington capital US”
CBOW_Data2 = “Washington has been the US’s capital, since 1970”. => “Washington capital US 1970 ”
Question : Well, Is removing the most common words always desired then?
Answer : No, removing the most common words is not always desirable. It depends on the situation. If we are going to capture the semantics then removal of such common words is desired.
However if our goal is also to use the syntactic information, then it is not desired. e.g noun often precedes a verb. i.e “Bill is hard-worker”.
As, ca be seen above, if we remove the common word “is”, then the sentence becomes “Bill hard-worker”, which will not be able to capture the syntactic information “Noun often precedes verb” and that “Bill” hence is a noun.
Question : Ok, I got the generic working of CBOW, but in the code how is the input and output created, based on some text document?
Answer : First we need to prepare the input and output data. Lets explain the input and the output with the example.
If Document = “anarchism originated as a term of abuse first used against early working class radicals”
then for input, we need to do the following
Step 1.1 : Determine batch size
Let’s consider the batch size to be 4. then our batch_data becomes
batch_data = “anarchism originated as a”
Step 1.2 : Determine number of surrounding words
Let’s consider the number of surrounding words to consider to be 1. i.e skip_window = 1, then our input_batch and output_label becomes,
Input => Output
[ anarchism, as ] => [ originated]
[ originated, a ] => [ as ]
Hence in our code,
cbow_no_of_output_word is :: cbow_no_of_output_word = batch_size – (2 * skip_window)
batch = np.ndarray(shape=( cbow_no_of_output_word , 2 * skip_window ), dtype=np.int32) labels = np.ndarray(shape=( cbow_no_of_output_word ,1), dtype=np.int32)
We use the function generate_batch(batch_size, skip_window, reverse_dictionary = None , verbose = 0) to generate train dataset and train label. See code for the function ‘generate_batch’
Step 2: Training CBOW Model
Once we have the training data and label, now we need to train our model. The architecture of which , based on the code on the Github Continous bag of words is
In the code, this look up is done based on the “all_vocab_wv” variable and the vector average is done using Tensorflow codes, for better efficiency.
wv_1_batch_1_words = tf.nn.embedding_lookup( all_vocab_wv , train_dataset[:,i])
In continuous bag of words since we have to come up with a label word [ originated] from two words [ anarchism, as ] , we have to average the vector of these two words.
for i in range(2*skip_window): wv_1_batch_1_words = tf.nn.embedding_lookup( all_vocab_wv , train_dataset[:,i]) print('embedding for train_dataset[:,i] "',train_dataset[:,i],'" is wv_1_batch_1_words :: "',wv_1_batch_1_words) emb_x,emb_y = wv_1_batch_1_words.get_shape().as_list() print('embedding_%d shape is : %s emb_x :"%s", emb_y: "%s" '\ %(i,wv_1_batch_1_words.get_shape().as_list(), emb_x, emb_y )) if wv_1_batch_n_words is None: print('embed is None, hence reshaping from %s to (%s,%s)'\ %(wv_1_batch_1_words.get_shape(), emb_x, emb_y) ) wv_1_batch_n_words = tf.reshape(wv_1_batch_1_words,[emb_x,emb_y,1]) wv_1_batch_n_words = tf.Print(wv_1_batch_n_words,[wv_1_batch_n_words]) else: print('embed is not None, hence concating earlier wv_1_batch_n_words %s with current wv_1_batch_1_words'\ %(wv_1_batch_n_words.get_shape()) ) wv_1_batch_n_words = tf.concat(2,[wv_1_batch_n_words,tf.reshape(wv_1_batch_1_words,[emb_x,emb_y,1])])
avg_wv_1_batch_n_words = tf.reduce_mean(wv_1_batch_n_words,2,keep_dims=False)
For computing the word similarity, we use the cosine similarity in the code, since the cosine similarity is not sensitive to the vector lengths.
norm = tf.sqrt(tf.reduce_sum(tf.square(all_vocab_wv), 1, keep_dims=True)) normalized_embeddings = all_vocab_wv / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
Question : Wow, Amazing!! Word2Vec can capture the word similarity. Now I want to do more complex tasks such as language translation. Can I do so with word 2 vector models?
Answer : No, we cannot. Word 2 vector models take in a mono-lingual input i.e English language text only or french language text only. We need to have a model that can take in both English and French as inputs and then able to learn the translation.
See “RNN for more details”