ValueError: No gradients provided for any variable

I am trying to implement a simple sequence-to-sequence model using Keras. However, I keep seeing the following ValueError:

ValueError: No gradients provided for any variable: ['simple_model/time_distributed/kernel:0', 'simple_model/time_distributed/bias:0', 'simple_model/embedding/embeddings:0', 'simple_model/conv2d/kernel:0', 'simple_model/conv2d/bias:0', 'simple_model/dense_1/kernel:0', 'simple_model/dense_1/bias:0'].

Other questions like this or looking at this issue on Github suggests that this might have something to do with the cross-entropy loss function; but I fail to see what I am doing wrong here.

I do not think that this is the problem, but I want to mention that I am on a nightly build of TensorFlow, tf-nightly==2.2.0.dev20200410 to be precise.

This following code is a standalone example and should reproduce the exception from above:

import random
from functools import partial

import tensorflow as tf
from tensorflow import keras
from tensorflow_datasets.core.features.text import SubwordTextEncoder

EOS = '<eos>'
PAD = '<pad>'

RESERVED_TOKENS = [EOS, PAD]
EOS_ID = RESERVED_TOKENS.index(EOS)
PAD_ID = RESERVED_TOKENS.index(PAD)

dictionary = [
    'verstehen',
    'verstanden',
    'vergessen',
    'verlegen',
    'verlernen',
    'vertun',
    'vertan',
    'verloren',
    'verlieren',
    'verlassen',
    'verhandeln',
]

dictionary = [word.lower() for word in dictionary]


class SimpleModel(keras.models.Model):

    def __init__(self, params, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.params = params
        self.out_layer = keras.layers.Dense(1, activation='softmax')

        self.model_layers = [
            keras.layers.Embedding(params['vocab_size'], params['vocab_size']),
            keras.layers.Lambda(lambda l: tf.expand_dims(l, -1)),
            keras.layers.Conv2D(1, 4),
            keras.layers.MaxPooling2D(1),
            keras.layers.Dense(1, activation='relu'),
            keras.layers.TimeDistributed(self.out_layer)
        ]

    def call(self, example, training=None, mask=None):
        x = example['inputs']
        for layer in self.model_layers:
            x = layer(x)
        return x


def sample_generator(text_encoder: SubwordTextEncoder, max_sample: int = None):
    count = 0

    while True:
        random.shuffle(dictionary)

        for word in dictionary:

            for i in range(1, len(word)):

                inputs = word[:i]
                targets = word

                example = dict(
                    inputs=text_encoder.encode(inputs) + [EOS_ID],
                    targets=text_encoder.encode(targets) + [EOS_ID],
                )
                count += 1

                yield example

                if max_sample is not None and count >= max_sample:
                    print('Reached max_samples (%d)' % max_sample)
                    return


def make_dataset(generator_fn, params, training):

    dataset = tf.data.Dataset.from_generator(
        generator_fn,
        output_types={
            'inputs': tf.int64,
            'targets': tf.int64,
        }
    ).padded_batch(
        params['batch_size'],
        padded_shapes={
            'inputs': (None,),
            'targets': (None,)
        },
    )

    if training:
        dataset = dataset.map(partial(prepare_example, params=params)).repeat()

    return dataset


def prepare_example(example: dict, params: dict):
    # Make sure targets are one-hot encoded
    example['targets'] = tf.one_hot(example['targets'], depth=params['vocab_size'])
    return example


def main():

    text_encoder = SubwordTextEncoder.build_from_corpus(
        iter(dictionary),
        target_vocab_size=1000,
        max_subword_length=6,
        reserved_tokens=RESERVED_TOKENS
    )

    generator_fn = partial(sample_generator, text_encoder=text_encoder, max_sample=10)

    params = dict(
        batch_size=20,
        vocab_size=text_encoder.vocab_size,
        hidden_size=32,
        max_input_length=30,
        max_target_length=30
    )

    model = SimpleModel(params)

    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
    )

    train_dataset = make_dataset(generator_fn, params, training=True)
    dev_dataset = make_dataset(generator_fn, params, training=False)

    # Peek data
    for train_batch, dev_batch in zip(train_dataset, dev_dataset):
        print(train_batch)
        print(dev_batch)
        break

    model.fit(
        train_dataset,
        epochs=1000,
        steps_per_epoch=100,
        validation_data=dev_dataset,
        validation_steps=100,
    )


if __name__ == '__main__':
    main()

Update

Gist link
Github issue link

There are two different sets of problems in your code, which could be categorized as syntactical and architectural problems. The error raised (i.e. No gradients provided for any variable) is related to the syntactical problems which I would mostly address below, but I would try to give you some pointers about the architectural problems after that as well.

The main cause of syntactical problems is about using named inputs and outputs for the model. Named inputs and outputs in Keras is mostly useful when the model has multiple input and/or output layers. However, your model has only one input and one output layer. Therefore, it may not be very useful to use named inputs and outputs here, but if that's your decision I would explain how it could be done properly.

First of all, you should keep in mind that when using Keras models, the data generated from any input pipeline (whether it's a Python generator or tf.data.Dataset) should be provided as a tuple i.e. (input_batch, output_batch) or (input_batch, output_batch, sample_weights). And, as I said, this is the expected format everywhere in Keras when dealing with input pipelines, even when we are using named inputs and outputs as dictionaries.

For example, if I want to use inputs/outputs naming and my model has two input layers named as "words" and "importance", and also two output layers named as "output1" and "output2", they should be formatted like this:

({'words': words_data, 'importance': importance_data},
 {'output1': output1_data, 'output2': output2_data})

So as you can see above, it's a tuple where each element of the tuple is a dictionary; the first element corresponds to inputs of the model and the second element corresponds to outputs of the model. Now, according to this point, let's see what modifications should be done to your code:

In sample_generator we should return a tuple of dicts, not a dict. So:

example = tuple([
     {'inputs': text_encoder.encode(inputs) + [EOS_ID]},
     {'targets': text_encoder.encode(targets) + [EOS_ID]},
])

In make_dataset function, the input arguments of tf.data.Dataset should respect this:

output_types=(
    {'inputs': tf.int64},
    {'targets': tf.int64}
)

padded_shapes=(
    {'inputs': (None,)},
    {'targets': (None,)}
)

The signature of prepare_example and its body should be modified as well:

def prepare_example(ex_inputs: dict, ex_outputs: dict, params: dict):
    # Make sure targets are one-hot encoded
    ex_outputs['targets'] = tf.one_hot(ex_outputs['targets'], depth=params['vocab_size'])
    return ex_inputs, ex_outputs

And finally, the call method of subclassed model:
```
return {'targets': x}
```
And one more thing: we should also put these names on the corresponding input and output layers using the name argument when constructing the layers (like Dense(..., name='output'); however, since we are using the Model sub-classing here to define our model, that's not necessary to do.

All right, these would resolve the input/output problems and the error related to gradients would be gone; however, if you run the code after applying the above modifications, you would still get an error regarding incompatible shapes. As I said earlier, there are architectural issues in your model which I would briefly address below.

As you mentioned, this is supposed to be a seq-to-seq model. Therefore, the output is a sequence of one-hot encoded vectors, where the length of each vector is equal to (target sequences) vocabulary size. As a result, the softmax classifier should have as much units as vocabulary size, like this (Note: never in any model or problem use a softmax layer with only one unit; that's all wrong! Think about why it's wrong!):

self.out_layer = keras.layers.Dense(params['vocab_size'], activation='softmax')

The next thing to consider is the fact that we are dealing with 1D sequences (i.e. a sequence of tokens/words). Therefore using 2D-convolution and 2D-pooling layers does not make sense here. You can either use their 1D counterparts or replace them with something else like RNN layers. As a result of this, the Lambda layer should be removed as well. Also, if you want to use convolution and pooling, you should adjust the number of filters in each layer as well as the pool size properly (i.e. one conv filter, Conv1D(1,...) is not probably optimal, and pool size of 1 does not make sense).

Further, that Dense layer before the last layer which has only one unit could severely limit the representational capacity of the model (i.e. it is essentially the bottleneck of your model). Either increase its number of units, or remove it.

The other thing is that there is no reason for not one-hot encoding the labels of dev set. Rather, they should be one-hot encoded like the labels of training set. Therefore, either the training argument of make_generator should be removed entirely or, if you have some other use case for it, the dev dataset should be created with training=True argument passed to make_dataset function.

Finally, after all these changes your model might work and start fitting on data; but after a few batches passed, you might get incompatible shapes error again. That's because you are generating input data with unknown dimension and also use a relaxed padding approach to pad each batch as much as needed (i.e. by using (None,) for padded_shapes). To resolve this you should decide on a fixed input/output dimension (e.g. by considering a fixed length for input/output sequences), and then adjust the architecture or hyper-parameters of the model (e.g. conv kernel size, conv padding, pooling size, adding more layers, etc.) as well as the padded_shapes argument accordingly. Even if you would like your model to support input/output sequences of variable length instead, then you should consider it in model's architecture and hyper-parameters and also the padded_shapes argument. Since this the solution depends on the task and desired design in your mind and there is no one-fits-all solutions, I would not comment further on that and leave it to you to figure it out. But here is a working solution (which may not be, and probably isn't, optimal at all) just to give you an idea:

self.out_layer = keras.layers.Dense(params['vocab_size'], activation='softmax')

self.model_layers = [
    keras.layers.Embedding(params['vocab_size'], params['vocab_size']),
    keras.layers.Conv1D(32, 4, padding='same'),
    keras.layers.TimeDistributed(self.out_layer)
]


# ...
padded_shapes=(
    {'inputs': (10,)},
    {'targets': (10,)}
)