Varying sequence length in Keras without padding

I have a question regarding varying sequence lengths for LSTMs in Keras. I'm passing batches of size 200 and sequences of variable lengths (= x) with 100 features for each object in the sequence (=> [200, x, 100]) into a LSTM:

LSTM(100, return_sequences=True, stateful=True, input_shape=(None, 100), batch_input_shape=(200, None, 100))

I'm fitting the model on the following randomly created matrices:

x_train = np.random.random((1000, 50, 100))
x_train_2 = np.random.random((1000, 10,100))

As far as I understood LSTMs (and the Keras implementation) correctly, the x should refer to the number of LSTM cells. For each LSTM cell a state and three matrices have to be learned (for input, state and output of the cell). How is it possible to pass varying sequence lengths into the LSTM without padding up to a max. specified length, like I did? The code is running, but it actually shouldn't (in my understanding). It's even possible to pass another x_train_3 with a sequence length of 60 afterwards, but there shouldn't be states and matrices for the extra 10 cells.

By the way, I'm using Keras version 1.0.8 and Tensorflow GPU 0.9.

Here is my example code:

from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np
from keras import backend as K

with K.get_session():

    # create model
    model = Sequential()
    model.add(LSTM(100, return_sequences=True, stateful=True, input_shape=(None, 100),
             batch_input_shape=(200, None, 100)))
    model.add(LSTM(100))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])

    # Generate dummy training data
    x_train = np.random.random((1000, 50, 100))
    x_train_2 = np.random.random((1000, 10, 100))
    y_train = np.random.random((1000, 2))
    y_train_2 = np.random.random((1000, 2))

    # Generate dummy validation data
    x_val = np.random.random((200, 50, 100))
    y_val = np.random.random((200, 2))

    # fit and eval models
    model.fit(x_train, y_train, batch_size=200, nb_epoch=1, shuffle=False, validation_data=(x_val, y_val), verbose=1)
    model.fit(x_train_2, y_train_2, batch_size=200, nb_epoch=1, shuffle=False, validation_data=(x_val, y_val), verbose=1)
    score = model.evaluate(x_val, y_val, batch_size=200, verbose=1)

First: it doesn't seem you need the stateful=True and the batch_input. These are intended for when you want to divide a very long sequence(s) in parts, and train each part separately without the model thinking that the sequence has come to an end.

When you use stateful layers, you have to reset/erase the states/memory manually when you decide that a certain batch is the last part of the long sequence(s).

You seem to be working with entire sequences. No stateful is needed.

Padding is not strictly necessary, but it seems you can use padding + masking to ignore the additional steps. If you don't want to use padding, you can separate your data in smaller batches, each batch with a distinct sequence length. See this: stackoverflow.com/questions/46144191

The sequence length (time steps) does not change the number of cells/units or the weighs. It's possible to train using different lengths. The dimension that cannot change is the amount of features.

Input dimensions:

The input dimensionas are (NumberOfSequences, Length, Features).
There is absolutely no relation between the input shape and the number of cells. It carries only the number of steps or recursions, which is the Length dimension.

Cells:

Cells in LSTM layers behave exacly like "units" in dense layers.

A cell is not a step. A cell is only the number of "parallel" operations. Each group of cells performs together the recurrent operations and steps.

There is conversation between the cells, as @Yu-Yang well noticed in the comments. But the idea of they being the same entity carried over through steps is still valid.

Those little blocks you see in images such as this are not cells, they are steps.

Variable lengths:

That said, the length of your sequences don't affect at all the number of parameters (matrices) in the LSTM layer. It just affects the number of steps.

The fixed number of matrices inside the layer will be recalculated more times for long sequences, and less times for short sequences. But in all cases, it's one matrix getting updates and being passed forward to the next step.

Sequence lengths vary only the number of updates.

The layer definition:

The number of cells can be any number at all, it will just define how many parallel mini brains will be working together (it means a more or less powerful network, and more or less output features).

LSTM(units=78) 
#will work perfectly well, and will output 78 "features".
#although it will be less intelligent than one with 100 units, outputting 100 features.

There is a unique weight matrix and a unique state/memory matrix that keeps being passed forward to the next steps. These matrices are simply "updated" in each step, but there isn't one matrix for each step.

Image examples:

Each box "A" is a step where the same group of matrices (states,weights,...) is used and updated.

There aren't 4 cells, but one and the same cell performing 4 updates, one update for each input.

Each X1, X2, ... is one slice of your sequence in the length dimension.

Longer sequences will reuse and update the matrices more times than shorter sequences. But it's still one cell.

The number of cells indeed affects the size of matrices, but doesn't depend on the sequence length. All cells will work togheter in parallel, with some conversation between them.

Your model

In your model you can create the LSTM layers like this:

model.add(LSTM(anyNumber, return_sequences=True, input_shape=(None, 100)))
model.add(LSTM(anyOtherNumber))

By using None in the input_shape like that, you are already telling your model that it accepts sequences in any length.

All you have to do is train. And your code for training is ok. The only thing that is not allowed is to create a batch with different lengths inside. So, as you have done, create a batch for each length and train each batch.