Word-level Seq2Seq with Keras

2019-04-12 07:48发布

问题:

I was following the Keras Seq2Seq tutorial, and wit works fine. However, this is a character-level model, and I would like to adopt it to a word-level model. The authors even include a paragraph with require changes but all my current attempts result in an error regarding wring dimensions.

If you follow the character-level model, the input data is of 3 dims: #sequences, #max_seq_len, #num_char since each character is one-hot encoded. When I plot the summary for the model as used in the tutorial, I get:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None, 71)     0                                            
_____________________________________________________________________________ __________________
input_2 (InputLayer)            (None, None, 94)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 335872      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  359424      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
                                                                 lstm_1[0][2]                     
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 94)     24158       lstm_2[0][0]                     
==================================================================================================

This compiles and trains just fine.

Now this tutorial has section "What if I want to use a word-level model with integer sequences?" And I've tried to follow those changes. Firstly, I encode all sequences using a word index. As such, the input and target data is now 2 dims: #sequences, #max_seq_len since I no longer one-hot encode but use now Embedding layers.

encoder_input_data_train.shape   =>  (90000, 9)
decoder_input_data_train.shape   =>  (90000, 16)
decoder_target_data_train.shape  =>  (90000, 16)

For example, a sequence might look like this:

[ 826.  288. 2961. 3127. 1260. 2108.    0.    0.    0.]

When I use the listed code:

# encoder
encoder_inputs = Input(shape=(None, ))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# decoder
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

the model compiles and looks like this:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_35 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
input_36 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_32 (Embedding)        (None, None, 256)    914432      input_35[0][0]                   
__________________________________________________________________________________________________
embedding_33 (Embedding)        (None, None, 256)    914432      input_36[0][0]                   
__________________________________________________________________________________________________
lstm_32 (LSTM)                  [(None, 256), (None, 525312      embedding_32[0][0]               
__________________________________________________________________________________________________
lstm_33 (LSTM)                  (None, None, 256)    525312      embedding_33[0][0]               
                                                                 lstm_32[0][1]                    
                                                                 lstm_32[0][2]                    
__________________________________________________________________________________________________
dense_21 (Dense)                (None, None, 3572)   918004      lstm_33[0][0]                    

While compile works, training

model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=32, epochs=1, validation_split=0.2)

fails with the following error: ValueError: Error when checking target: expected dense_21 to have 3 dimensions, but got array with shape (90000, 16) with the latter being the shape of the decoder input/target. Why does the Dense layer an array of the shape of the decoder input data?

Things I've tried:

  • I find it a bit strange that the decoder LSTM has a return_sequences=True since I thought I cannot give a sequences to a Dense layer (and the decoder of the original character-level model does not state this). However, simply removing or setting return_sequences=False did not help. Of course, the Dense layer now has an output shape of (None, 3572).
  • I don' quite get the need for the Input layers. I've set them to shape=(max_input_seq_len, ) and shape=(max_target_seq_len, ) respectively so that the summary doesn't show (None, None) but the respective values, e.g., (None, 16). No change.
  • In the Keras Docs I've read that an Embedding layer should be used with input_length, otherwise a Dense layer upstream cannot compute its outputs. But again, still errors when I set input_length accordingly.

I'm a bit at a deadlock right? Am I even on the right track or do I missing something more fundamentally. Is the shape of my data wrong? Why does the last Dense layer get array with shape (90000, 16)? That seems rather off.

UPDATE: I figured out that the problem seems to be decoder_target_data which currently has the shape (#sample, max_seq_len), e.g., (90000, 16). But I assume I need to one-hot encode the target output with respect to the vocabulary: (#sample, max_seq_len, vocab_size), e.g., (90000, 16, 3572).

Unfortunately, this throws a Memory error. However, when I do for debugging purposes, i.e., assume a vocabulary size of 10:

decoder_target_data = np.zeros((len(input_sequences), max_target_seq_len, 10), dtype='float32')

and later in the decoder model:

x = Dense(10, activation='softmax')(x)

then the model trains without error. In case that's indeed my issue, I have to train the model with manually generate batches so I can keep the vocabulary size but reduce the #samples, e.g., to 90 batches each of shape (1000, 16, 3572). Am I on the right track here?