LSTM - Making predictions on partial sequence

This question is in continue to a previous question I've asked.

I've trained an LSTM model to predict a binary class (1 or 0) for batches of 100 samples with 3 features each, i.e: the shape of the data is (m, 100, 3), where m is the number of batches.

Data:

[
    [[1,2,3],[1,2,3]... 100 sampels],
    [[1,2,3],[1,2,3]... 100 sampels],
    ... avaialble batches in the training data
]

Target:

[
   [1]
   [0]
   ...
]

Model code:

def build_model(num_samples, num_features, is_training):
    model = Sequential()
    opt = optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

    batch_size = None if is_training else 1
    stateful = False if is_training else True
    first_lstm = LSTM(32, batch_input_shape=(batch_size, num_samples, num_features), return_sequences=True,
                      activation='tanh', stateful=stateful)

    model.add(first_lstm)
    model.add(LeakyReLU())
    model.add(Dropout(0.2))
    model.add(LSTM(16, return_sequences=True, activation='tanh', stateful=stateful))
    model.add(Dropout(0.2))
    model.add(LeakyReLU())
    model.add(LSTM(8, return_sequences=False, activation='tanh', stateful=stateful))
    model.add(LeakyReLU())
    model.add(Dense(1, activation='sigmoid'))

    if is_training:
        model.compile(loss='binary_crossentropy', optimizer=opt,
                      metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])
    return model

For the training stage, the model is NOT stateful. When predicting I'm using a stateful model, iterating over the data and outputting a probability for each sample:

for index, row in data.iterrows():
    if index % 100 == 0:
        predicting_model.reset_states()
    vals = np.array([[row[['a', 'b', 'c']].values]])
    prob = predicting_model.predict_on_batch(vals)

When looking at the probability at the end of a batch, it is exactly the value I get when predicting with the entire batch (not one by one). However, I've expected that the probability will always continue in the right direction when new samples arrive. What actually happens is that the probability output can spike to the wrong class on an arbitrary sample (see below).

Two samples of 100 sample batches over the time of prediction (label = 1):

and Label = 0:

Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?

Any explanation, advice would be appreciated.

Update Thanks to @today advice, I've tried training the network with the hidden state output for each input time step using return_sequence=True on the last LSTM layer.

So now the labels look like so (shape (100,100)):

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
...]

the model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 100, 32)           4608      
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 100, 32)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 16)           3136      
_________________________________________________________________
dropout_2 (Dropout)          (None, 100, 16)           0         
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 100, 16)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100, 8)            800       
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU)    (None, 100, 8)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 100, 1)            9         
=================================================================
Total params: 8,553
Trainable params: 8,553
Non-trainable params: 0
_________________________________________________________________

However, I get an exception:

ValueError: Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (75, 100)

What do I need to fix?

Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.

Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?

You can do this experiment: set the return_sequences argument of last LSTM layer to True and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function like np.repeat). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.

Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use np.expand_dims to add another axis of size one to the end. The correct way of repeating the labels would look like this, assuming y_train has a shape of (num_samples,):

rep_y_train = np.repeat(y_train, num_reps).reshape(-1, num_reps, 1)

The experiment on IMDB dataset:

Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:

from keras.layers import *
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
import numpy as np

vocab_size = 10000
max_len = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(x_train, maxlen=max_len)

def create_model(return_seq=False, stateful=False):
    batch_size = 1 if stateful else None
    model = Sequential()
    model.add(Embedding(vocab_size, 128, batch_input_shape=(batch_size, None)))
    model.add(CuDNNLSTM(64, return_sequences=return_seq, stateful=stateful))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
    return model

# train model with one label per sample
train_model = create_model()
train_model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# replicate the labels
y_train_rep = np.repeat(y_train, max_len).reshape(-1, max_len, 1)

# train model with one label per timestep
rep_train_model = create_model(True)
rep_train_model.fit(X_train, y_train_rep, epochs=10, batch_size=128, validation_split=0.3)

Then we can create the stateful replicas of the training models and run them on some test data to compare their results:

# replica of `train_model` with the same weights
test_model = create_model(False, True)
test_model.set_weights(train_model.get_weights())
test_model.reset_states()

# replica of `rep_train_model` with the same weights
rep_test_model = create_model(True, True)
rep_test_model.set_weights(rep_train_model.get_weights())
rep_test_model.reset_states()

def stateful_predict(model, samples):
    preds = []
    for s in samples:
        model.reset_states()
        ps = []
        for ts in s:
            p = model.predict(np.array([[ts]]))
            ps.append(p[0,0])
        preds.append(list(ps))
    return preds

X_test = pad_sequences(x_test, maxlen=max_len)

Actually, the first sample of X_test has a 0 label (i.e. belongs to negative class) and the second sample of X_test has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction of test_model (i.e. the one that were trained using one label per sample) for these two samples would look like:

import matplotlib.pyplot as plt

preds = stateful_predict(test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])

The result:

Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the rep_test_model (i.e. the one that were trained using one label per each timestep):

preds = stateful_predict(rep_test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])

The result:

Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.

Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.