I've been trying to understand the sample code with https://www.tensorflow.org/tutorials/recurrent which you can find at https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
(Using tensorflow 1.3.0.)
I've summarized (what I think are) the key parts, for my question, below:
size = 200
vocab_size = 10000
layers = 2
# input_.input_data is a 2D tensor [batch_size, num_steps] of
# word ids, from 1 to 10000
cell = tf.contrib.rnn.MultiRNNCell(
[tf.contrib.rnn.BasicLSTMCell(size) for _ in range(2)]
)
embedding = tf.get_variable(
"embedding", [vocab_size, size], dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
inputs = tf.unstack(inputs, num=num_steps, axis=1)
outputs, state = tf.contrib.rnn.static_rnn(
cell, inputs, initial_state=self._initial_state)
output = tf.reshape(tf.stack(axis=1, values=outputs), [-1, size])
softmax_w = tf.get_variable(
"softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
# Then calculate loss, do gradient descent, etc.
My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence? Concretely, I imagine the flow is like this, but I cannot get my head around what the code for the commented lines would be:
prefix = ["What", "is", "your"]
state = #Zeroes
# Call static_rnn(cell) once for each word in prefix to initialize state
# Use final output to set a string, next_word
print(next_word)
My sub-questions are:
- Why use a random (uninitialized, untrained) word-embedding?
- Why use softmax?
- Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)
- How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?
(I'm asking them all as one question, as I suspect they are all connected, and connected to some gap in my understanding.)
What I was expecting to see here was loading an existing word2vec set of word embeddings (e.g. using gensim's KeyedVectors.load_word2vec_format()
), convert each word in the input corpus to that representation when loading in each sentence, and then afterwards the LSTM would spit out a vector of the same dimension, and we would try and find the most similar word (e.g. using gensim's similar_by_vector(y, topn=1)
).
Is using softmax saving us from the relatively slow similar_by_vector(y, topn=1)
call?
BTW, for the pre-existing word2vec part of my question Using pre-trained word2vec with LSTM for word generation is similar. However the answers there, currently, are not what I'm looking for. What I'm hoping for is a plain English explanation that switches the light on for me, and plugs whatever the gap in my understanding is. Use pre-trained word2vec in lstm language model? is another similar question.
UPDATE: Predicting next word using the language model tensorflow example and Predicting the next word using the LSTM ptb model tensorflow example are similar questions. However, neither shows the code to actually take the first few words of a sentence, and print out its prediction of the next word. I tried pasting in code from the 2nd question, and from https://stackoverflow.com/a/39282697/841830 (which comes with a github branch), but cannot get either to run without errors. I think they may be for an earlier version of TensorFlow?
ANOTHER UPDATE: Yet another question asking basically the same thing: Predicting Next Word of LSTM Model from Tensorflow Example It links to Predicting next word using the language model tensorflow example (and, again, the answers there are not quite what I am looking for).
In case it still isn't clear, what I am trying to write a high-level function called getNextWord(model, sentencePrefix)
, where model
is a previously built LSTM that I've loaded from disk, and sentencePrefix
is a string, such as "Open the", and it might return "pod". I then might call it with "Open the pod" and it will return "bay", and so on.
An example (with a character RNN, and using mxnet) is the sample()
function shown near the end of https://github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter05_recurrent-neural-networks/simple-rnn.ipynb
You can call sample()
during training, but you can also call it after training, and with any sentence you want.
Main Question
Loading words
Load custom data instead of using the test set:
test_data
should contain word ids (print outword_to_id
for a mapping). As an example, it should look like: [1, 52, 562, 246] ...Displaying predictions
We need to return the output of the FC layer (
logits
) in the call tosess.run
Later in the function,
vals['top_word_id']
will have an array of integers with the ID of the top word. Look this up inword_to_id
to determine the predicted word. I did this a while ago with the small model, and the top 1 accuracy was pretty low (20-30% iirc), even though the perplexity was what was predicted in the header.Subquestions
You'd have to ask the authors, but in my opinion, training the embeddings makes this more of a standalone tutorial: instead of treating embedding as a black box, it shows how it works.
The final prediction is not determined by the cosine similarity to the output of the hidden layer. There is an FC layer after the LSTM that converts the embedded state to a one-hot encoding of the final word.
Here's a sketch of the operations and dimensions in the neural net:
Technically, no. If you look at the LSTM equations, you'll notice that x (the input) can be any size, as long as the weight matrix is adjusted appropriately.
I don't know, sorry.
You can find all the code at the end of the answer.
Most of your questions (why a Softmax, how to use pretrained embedding layer, etc...) were answered I reckon. However as you were still waiting for a concise code to produce generated text from a seed, here I try to report how I ended up doing it myself.
I struggled, starting from the official Tensorflow tutorial, to get to the point were I could easily generate words from a produced model. Fortunately after taking some bits of answer in practically all the answers you mentioned in your question, I got a better view of the problem (and solutions). This might contains errors, but at least it runs and generates some text...
I will wrap the next word suggestion in a loop, to generate a whole sentence, but you will easily reduce that to one word only.
Let's say you followed the current tutorial given by tensorflow (v1.4 at time of writing) here, which will save a model after training it.
Then what is left for us to do is to load it from disk, and to write a function which take this model and some seed input and returns generated text.
Generate text from saved model
I assume we write all this code in a new python script. Whole script at the bottom as a recap, here I explain the main steps.
First necessary steps
Now, quite importantly, we create dictionnaries to map ids to words and vice-versa (so we don't have to read a list of integers...).
Then we load the configuration class, also setting
num_steps
andbatch_size
to 1, as we want to sample 1 word at a time while the LSTM will process also 1 word at a time. Also creating the input instance on the fly:Building graph
To load the saved model (as saved by the
Supervisor.saver
module in the tutorial), we need first to rebuild the graph (easy with thePTBModel
class) which must use the same configuration as when trained:Restoring saved weights:
... Sampling words from a given seed:
First we need the model to contain an access to the logits outputs, or more precisely the probability distribution over the whole vocabulary. So in the
ptb_lstm.py
file add the line:Then we can design some sampling function (you're free to use whatever you like here, best approach is sampling with a temperature that tends to flatten or sharpen the distributions), here is a basic random sampling method:
And finally a function that takes a seed, your model, the dictionary that maps word to ids, and vice versa, as inputs and outputs the generated string of texts:
TL;DR
Do not forget to add the line:
In the
ptb_lstm.py
file, in the__init__
definition ofPTBModel
class, anywhere after the linelogits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
.The whole script, just run it from the same directory where you have
reader.py
,ptb_lstm.py
:Update
As for restoring old checkpoints (for me the model saved 6 months ago, not sure about exact TF version used then) with recent tensorflow (1.6 at least), it might raise an error about some variables not being found (see comment). In that case, you should update your checkpoints using this script.
Also, note that for me, I had to modify this even further, as I noticed the
saver.restore
function was trying to readlstm_cell
variables although my variables were transformed intobasic_lstm_cell
which led also toNotFound Error
. So an easy fix, just a small change in thecheckpoint_convert.py
script, line 72-73, is to removebasic_
in the new names.A convenient way to check the name of the variables contained in your checkpoints is (
CKPT_FILE
is the suffix that comes before.index
,.data0000-1000
, etc..):This way you can verify that you have indeed the correct names (or the bad ones in the old checkpoints versions).
There are many questions, I would try to clarify some of them.
The key point here is, next word generation is actually word classification in the vocabulary. So you need a classifier, that is why there is a softmax in the output.
The principle is, at each time step, the model would output the next word based on the last word embedding and internal memory of previous words.
tf.contrib.rnn.static_rnn
automatically combine input into the memory, but we need to provide the last word embedding and classify the next word.We can use a pre-trained word2vec model, just init the
embedding
matrix with the pre-trained one. I think the tutorial uses random matrix for the sake of simplicity. Memory size is not related to embedding size, you can use larger memory size to retain more information.These tutorials are high-level. If you want to deeply understand the details, I would suggest looking at the source code in plain python/numpy.
Before I explain my answer, first a remark about your suggestion to
# Call static_rnn(cell) once for each word in prefix to initialize state
: Keep in mind thatstatic_rnn
does not return a value like a numpy array, but a tensor. You can evaluate a tensor to a value when it is run (1) in a session (a session is keeps the state of your computional graph, including the values of your model parameters) and (2) with the input that is necessary to calculate the tensor value. Input can be supplied using input readers (the approach in the tutorial), or using placeholders (what I will use below).Now follows the actual answer: The model in the tutorial was designed to read input data from a file. The answer of @user3080953 already showed how to work with your own text file, but as I understand it you need more control over how the data is fed to the model. To do this you will need to define your own placeholders and feed the data to these placeholders when calling
session.run()
.In the code below I subclassed
PTBModel
and made it responsible for explicitly feeding data to the model. I introduced a specialPTBInteractiveInput
that has an interface similar toPTBInput
so you can reuse the functionality inPTBModel
. To train your model you still needPTBModel
.In the
__init__
function ofPTBModel
you need to add this line:First note that, although the embeddings are random in the beginning, they will be trained with the rest of the network. The embeddings you obtain after training will have similar properties than the embeddings you obtain with word2vec models, e.g., the ability to answer analogy questions with vector operations (king - man + woman = queen, etc.) In tasks were you have a considerable amount of training data like language modelling (which does not need annotated training data) or neural machine translation, it is more common to train embeddings from scratch.
Softmax is a function that normalizes a vector of similarity scores (the logits), to a probability distribution. You need a probability distribution to train you model with cross-entropy loss and to be able to sample from the model. Note that if you are only interested in the most likely words of a trained model, you don't need the softmax and you can use the logits directly.
No, in principal it can be any value. Using a hidden state with a lower dimension than your embedding dimension, does not make much sense, however.
Here is a self-contained example of initializing an embedding with a given numpy array. If you want that the embedding remains fixed/constant during training, set
trainable
toFalse
.