LSTM/RNN can be used for text generation. This shows way to use pre-trained GloVe word embeddings for Keras model.
- How to use pre-trained Word2Vec word embeddings with Keras LSTM model? This post did help.
- How to predict / generate next word when the model is provided with the sequence of words as its input?
Sample approach tried:
# Sample code to prepare word2vec word embeddings
import gensim
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]
word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)
# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation
embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')
Sample code / psuedocode to train LSTM and predict will be appreciated.
I've created a gist with a simple generator that builds on top of your initial idea: it's an LSTM network wired to the pre-trained word2vec embeddings, trained to predict the next word in a sentence. The data is the list of abstracts from arXiv website.
I'll highlight the most important parts here.
Gensim Word2Vec
Your code is fine, except for the number of iterations to train it. The default
iter=5
seems rather low. Besides, it's definitely not the bottleneck -- LSTM training takes much longer.iter=100
looks better.The result embedding matrix is saved into
pretrained_weights
array which has a shape(vocab_size, emdedding_size)
.Keras model
Your code is almost correct, except for the loss function. Since the model predicts the next word, it's a classification task, hence the loss should be
categorical_crossentropy
orsparse_categorical_crossentropy
. I've chosen the latter for efficiency reasons: this way it avoids one-hot encoding, which is pretty expensive for a big vocabulary.Note passing the pre-trained weights to
weights
.Data preparation
In order to work with
sparse_categorical_crossentropy
loss, both sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.Sample generation
This is pretty straight-forward: the model outputs the vector of probabilities, of which the next word is sampled and appended to the input. Note that the generated text would be better and more diverse if the next word is sampled, rather than picked as
argmax
. The temperature based random sampling I've used is described here.Examples of generated text
Doesn't make too much sense, but is able to produce sentences that look at least grammatically sound (sometimes).
The link to the complete runnable script.