How to build an embedding layer in Tensorflow RNN?

I'm building an RNN LSTM network to classify texts based on the writers' age (binary classification - young / adult).

Seems like the network does not learn and suddenly starts overfitting:

Red: train
Blue: validation

One possibility could be that the data representation is not good enough. I just sorted the unique words by their frequency and gave them indices. E.g.:

unknown -> 0
the     -> 1
a       -> 2
.       -> 3
to      -> 4

So I'm trying to replace that with word embedding. I saw a couple of examples but I'm not able to implement it in my code. Most of the examples look like this:

embedding = tf.Variable(tf.random_uniform([vocab_size, hidden_size], -1, 1))
inputs = tf.nn.embedding_lookup(embedding, input_data)

Does this mean we're building a layer that learns the embedding? I thought that one should download some Word2Vec or Glove and just use that.

Anyway let's say I want to build this embedding layer...
If I use these 2 lines in my code I get an error:

TypeError: Value passed to parameter 'indices' has DataType float32 not in list of allowed values: int32, int64

So I guess I have to change the input_data type to int32. So I do that (it's all indices after all), and I get this:

TypeError: inputs must be a sequence

I tried wrapping inputs (argument to tf.contrib.rnn.static_rnn) with a list: [inputs] as suggested in this answer, but that produced another error:

ValueError: Input size (dimension 0 of inputs) must be accessible via shape inference, but saw value None.

Update:

I was unstacking the tensor x before passing it to embedding_lookup. I moved the unstacking after the embedding.

Updated code:

MIN_TOKENS = 10
MAX_TOKENS = 30
x = tf.placeholder("int32", [None, MAX_TOKENS, 1])
y = tf.placeholder("float", [None, N_CLASSES]) # 0.0 / 1.0
...
seqlen = tf.placeholder(tf.int32, [None]) #list of each sequence length*
embedding = tf.Variable(tf.random_uniform([VOCAB_SIZE, HIDDEN_SIZE], -1, 1))
inputs = tf.nn.embedding_lookup(embedding, x) #x is the text after converting to indices
inputs = tf.unstack(inputs, MAX_POST_LENGTH, 1)
outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, inputs, dtype=tf.float32, sequence_length=seqlen) #---> Produces error

*seqlen: I zero-padded the sequences so all of them have the same list size, but since the actual size differ, I prepared a list describing the length without the padding.

New error:

ValueError: Input 0 of layer basic_lstm_cell_1 is incompatible with the layer: expected ndim=2, found ndim=3. Full shape received: [None, 1, 64]

64 is the size of each hidden layer.

It's obvious that I have a problem with the dimensions... How can I make the inputs fit the network after embedding?

回答1:

From the tf.nn.static_rnn , we can see the inputs arguments to be:

A length T list of inputs, each a Tensor of shape [batch_size, input_size]

So your code should be something like:

x = tf.placeholder("int32", [None, MAX_TOKENS])
...
inputs = tf.unstack(inputs, axis=1)

回答2:

tf.squeeze is a method that removes dimensions of size 1 from the tensor. If the end goal is to have the input shape as [None,64], then put a line similar to inputs = tf.squeeze(inputs) and that would fix your problem.