I am doing an image captioning task using vectors for representing both images and captions.

The caption vectors have a legth/dimension of size 128. The image vectors have a length/dimension of size 2048.

What I want to do is to train an autoencoder, to get an encoder which is able to convert text vector into a image vector. And a decoder which is able to convert an image vector into a text vector.

Encoder: 128 -> 2048.

Decoder: 2048 -> 128.

I followed this tutorial to implement a shallow network doing what I wanted.

But I cant figure out how to create a deep network, following the same tutorial.

x_dim = 128
y_dim = 2048
x_dim_shape = Input(shape=(x_dim,))
encoded = Dense(512, activation='relu')(x_dim_shape)
encoded = Dense(1024, activation='relu')(encoded)
encoded = Dense(y_dim, activation='relu')(encoded)

decoded = Dense(1024, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(decoded)
decoded = Dense(x_dim, activation='sigmoid')(decoded)

# this model maps an input to its reconstruction
autoencoder = Model(input=x_dim_shape, output=decoded)

# this model maps an input to its encoded representation
encoder = Model(input=x_dim_shape, output=encoded)

encoded_input = Input(shape=(y_dim,))
decoder_layer1 = autoencoder.layers[-3]
decoder_layer2 = autoencoder.layers[-2]
decoder_layer3 = autoencoder.layers[-1]

# create the decoder model
decoder = Model(input=encoded_input, output=decoder_layer3(decoder_layer2(decoder_layer1(encoded_input))))

autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')


autoencoder.fit(training_data_x, training_data_y,
                nb_epoch=50,
                batch_size=256,
                shuffle=True,
                validation_data=(test_data_x, test_data_y))

The training_data_x and test_data_x have 128 dimensions. The training_data_y and test_data_y have 2048 dimensions.

The error I receive while trying to run this is the following:

Exception: Error when checking model target: expected dense_6 to have shape (None, 128) but got array with shape (32360, 2048)

dense_6 is the last decoded variable.

Autoencoders

If you want is to be able to call the encoder and decoder separately, what you need to do is train the whole autoencoder exactly as per the tutorial, with input_shape == output_shape (== 128 in your case), and only then can you call a subset of the layers:

x_dim = 128
y_dim = 2048
x_dim_shape = Input(shape=(x_dim,))
encoded = Dense(512, activation='relu')(x_dim_shape)
encoded = Dense(1024, activation='relu')(encoded)
encoded = Dense(y_dim, activation='relu')(encoded)

decoded = Dense(1024, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(decoded)
decoded = Dense(x_dim, activation='sigmoid')(decoded)

# this model maps an input to its reconstruction
autoencoder = Model(input=x_dim_shape, output=decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
autoencoder.fit(training_data_x, training_data_x, nb_epoch=50, batch_size=256, shuffle=True, validation_data=(test_data_x, test_data_y))

# test the decoder model
encoded_input = Input(shape=(y_dim,))
decoder_layer1 = autoencoder.layers[-3]
decoder_layer2 = autoencoder.layers[-2]
decoder_layer3 = autoencoder.layers[-1]

decoder = Model(input=encoded_input, output=decoder_layer3(decoder_layer2(decoder_layer1(encoded_input))))
decoder.compile(optimizer='adadelta', loss='binary_crossentropy')
eval = decoder.evaluate(test_data_y, test_data_x)
print('Decoder evaluation: {:.2f}'.format(eval))

Notice that, when calling autoencoder.fit(), x == y in the arguments. This is how the auto-encoder would (normally) have to optimize the bottleneck representation (that you call y in your own code) to best fit the original image with less dimensions.

But, as a transition to the second part of this answer, notice that in your case, x_dim < y_dim. You are actually training a model to increase the data dimensionality, which doesn't make much sense, AFAICT.

Your problem

Now reading your question again, I don't think autoencoders are any good for what you want to achieve. They are designed to reduce the dimensionality of the data, with a minimum of casualties.

What you are trying to do is:

Render a text to an image (what you call encode)
Read a text from an image (what you call decode)

In my understanding, while 2. might indeed require some machine learning, 1. definitely doesn't: there are plenty of libraries to write text on images out there.