Word-level Seq2Seq with Keras

I was following the Keras Seq2Seq tutorial, and wit works fine. However, this is a character-level model, and I would like to adopt it to a word-level model. The authors even include a paragraph with require changes but all my current attempts result in an error regarding wring dimensions.

If you follow the character-level model, the input data is of 3 dims: #sequences, #max_seq_len, #num_char since each character is one-hot encoded. When I plot the summary for the model as used in the tutorial, I get:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None, 71)     0                                            
_____________________________________________________________________________ __________________
input_2 (InputLayer)            (None, None, 94)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 335872      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  359424      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
                                                                 lstm_1[0][2]                     
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 94)     24158       lstm_2[0][0]                     
==================================================================================================

This compiles and trains just fine.

Now this tutorial has section "What if I want to use a word-level model with integer sequences?" And I've tried to follow those changes. Firstly, I encode all sequences using a word index. As such, the input and target data is now 2 dims: #sequences, #max_seq_len since I no longer one-hot encode but use now Embedding layers.

encoder_input_data_train.shape   =>  (90000, 9)
decoder_input_data_train.shape   =>  (90000, 16)
decoder_target_data_train.shape  =>  (90000, 16)

For example, a sequence might look like this:

[ 826.  288. 2961. 3127. 1260. 2108.    0.    0.    0.]

When I use the listed code:

# encoder
encoder_inputs = Input(shape=(None, ))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# decoder
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

the model compiles and looks like this:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_35 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
input_36 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_32 (Embedding)        (None, None, 256)    914432      input_35[0][0]                   
__________________________________________________________________________________________________
embedding_33 (Embedding)        (None, None, 256)    914432      input_36[0][0]                   
__________________________________________________________________________________________________
lstm_32 (LSTM)                  [(None, 256), (None, 525312      embedding_32[0][0]               
__________________________________________________________________________________________________
lstm_33 (LSTM)                  (None, None, 256)    525312      embedding_33[0][0]               
                                                                 lstm_32[0][1]                    
                                                                 lstm_32[0][2]                    
__________________________________________________________________________________________________
dense_21 (Dense)                (None, None, 3572)   918004      lstm_33[0][0]

While compile works, training

model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=32, epochs=1, validation_split=0.2)

fails with the following error: ValueError: Error when checking target: expected dense_21 to have 3 dimensions, but got array with shape (90000, 16) with the latter being the shape of the decoder input/target. Why does the Dense layer an array of the shape of the decoder input data?

Things I've tried:

I find it a bit strange that the decoder LSTM has a return_sequences=True since I thought I cannot give a sequences to a Dense layer (and the decoder of the original character-level model does not state this). However, simply removing or setting return_sequences=False did not help. Of course, the Dense layer now has an output shape of (None, 3572).
I don' quite get the need for the Input layers. I've set them to shape=(max_input_seq_len, ) and shape=(max_target_seq_len, ) respectively so that the summary doesn't show (None, None) but the respective values, e.g., (None, 16). No change.
In the Keras Docs I've read that an Embedding layer should be used with input_length, otherwise a Dense layer upstream cannot compute its outputs. But again, still errors when I set input_length accordingly.

I'm a bit at a deadlock right? Am I even on the right track or do I missing something more fundamentally. Is the shape of my data wrong? Why does the last Dense layer get array with shape (90000, 16)? That seems rather off.

UPDATE: I figured out that the problem seems to be decoder_target_data which currently has the shape (#sample, max_seq_len), e.g., (90000, 16). But I assume I need to one-hot encode the target output with respect to the vocabulary: (#sample, max_seq_len, vocab_size), e.g., (90000, 16, 3572).

Unfortunately, this throws a Memory error. However, when I do for debugging purposes, i.e., assume a vocabulary size of 10:

decoder_target_data = np.zeros((len(input_sequences), max_target_seq_len, 10), dtype='float32')

and later in the decoder model:

x = Dense(10, activation='softmax')(x)

then the model trains without error. In case that's indeed my issue, I have to train the model with manually generate batches so I can keep the vocabulary size but reduce the #samples, e.g., to 90 batches each of shape (1000, 16, 3572). Am I on the right track here?