I am trying to properly understand the batch_input
and batch_labels
from the tensorflow "Vector Representations of Words" tutorial.
For instance, my data
1 1 1 1 1 1 1 1 5 251 371 371 1685 ...
... starts with
skip_window = 2 # How many words to consider left and right.
num_skips = 1 # How many times to reuse an input to generate a label.
Then the generated input array is:
bach_input = 1 1 1 1 1 1 5 251 371 ....
This makes sense, starts from after 2 (= window size) and then continuous. The labels:
batch_labels = 1 1 1 1 1 1 251 1 1685 371 589 ...
I don't understand these labels very well. There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label
variable is the same length.
From the tensorflow tutorial:
The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words.
As per the tutorial, I have declared the two variables as:
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
How should I interpret the batch_labels
?
The key setting is
num_skips = 1
. This value defines the number of(input, label)
tuples each word generates. See the examples with differentnum_skips
below (mydata
sequence seems to be different from yours, sorry about that).Example #1 -
num_skips=4
It generates 4 labels for each word, i.e. uses the whole context; since
batch_size=8
only 2 words are processed in this batch (12 and 6), the rest will go into the next batch:Example #2 -
num_skips=2
Here you would expect each word appear twice in the
batch
sequence; the 2 labels are randomly sampled from 4 possible words:Example #3 -
num_skips=1
Finally, this setting, same as yours, produces exactly one label per each word; each label is drawn randomly from the 4-word context:
Each label is the center word to be predicted from the context. But the generated data may take not all
(context, center)
tuples, depending on the settings of the generator.Also note that the
train_labels
tensor is 1-dimensional. Skip-Gram trains the model to predict any context word from the given center word, not all 4 context words at once. This explains why all training pairs(12, 6)
,(12, 3084)
,(12, 5239)
and(12, 195)
are valid.