Understanding input and labels in word2vec (Tensor

2019-05-15 06:48发布

问题:

I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial.

For instance, my data

 1 1 1 1 1 1 1 1 5 251 371 371 1685 ...

... starts with

skip_window = 2 # How many words to consider left and right.
num_skips = 1 # How many times to reuse an input to generate a label.

Then the generated input array is:

bach_input = 1 1 1 1 1 1 5 251 371 ....  

This makes sense, starts from after 2 (= window size) and then continuous. The labels:

batch_labels = 1 1 1 1 1 1 251 1 1685 371 589 ...

I don't understand these labels very well. There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.

From the tensorflow tutorial:

The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words.

As per the tutorial, I have declared the two variables as:

  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

How should I interpret the batch_labels?

回答1:

There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.

The key setting is num_skips = 1. This value defines the number of (input, label) tuples each word generates. See the examples with different num_skips below (my data sequence seems to be different from yours, sorry about that).

Example #1 - num_skips=4

batch, labels = generate_batch(batch_size=8, num_skips=4, skip_window=2)

It generates 4 labels for each word, i.e. uses the whole context; since batch_size=8 only 2 words are processed in this batch (12 and 6), the rest will go into the next batch:

data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [12 12 12 12  6  6  6  6]
labels = [[6 3084 5239 195 195 3084 12 2]]

Example #2 - num_skips=2

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=2)

Here you would expect each word appear twice in the batch sequence; the 2 labels are randomly sampled from 4 possible words:

data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [ 12  12   6   6 195 195   2   2]
labels = [[ 195 3084   12  195 3137   12   46  195]]

Example #3 - num_skips=1

batch, labels = generate_batch(batch_size=8, num_skips=1, skip_window=2)

Finally, this setting, same as yours, produces exactly one label per each word; each label is drawn randomly from the 4-word context:

data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [  12    6  195    2 3137   46   59  156]
labels = [[  6  12  12 195  59 156  46  46]]

How should I interpret the batch_labels?

Each label is the center word to be predicted from the context. But the generated data may take not all (context, center) tuples, depending on the settings of the generator.

Also note that the train_labels tensor is 1-dimensional. Skip-Gram trains the model to predict any context word from the given center word, not all 4 context words at once. This explains why all training pairs (12, 6), (12, 3084), (12, 5239) and (12, 195) are valid.