This question is in continue to a previous question I've asked.
I've trained an LSTM model to predict a binary class (1 or 0) for batches of 100 samples with 3 features each, i.e: the shape of the data is (m, 100, 3), where m is the number of batches.
Data:
[
[[1,2,3],[1,2,3]... 100 sampels],
[[1,2,3],[1,2,3]... 100 sampels],
... avaialble batches in the training data
]
Target:
[
[1]
[0]
...
]
Model code:
def build_model(num_samples, num_features, is_training):
model = Sequential()
opt = optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)
batch_size = None if is_training else 1
stateful = False if is_training else True
first_lstm = LSTM(32, batch_input_shape=(batch_size, num_samples, num_features), return_sequences=True,
activation='tanh', stateful=stateful)
model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh', stateful=stateful))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(LSTM(8, return_sequences=False, activation='tanh', stateful=stateful))
model.add(LeakyReLU())
model.add(Dense(1, activation='sigmoid'))
if is_training:
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])
return model
For the training stage, the model is NOT stateful. When predicting I'm using a stateful model, iterating over the data and outputting a probability for each sample:
for index, row in data.iterrows():
if index % 100 == 0:
predicting_model.reset_states()
vals = np.array([[row[['a', 'b', 'c']].values]])
prob = predicting_model.predict_on_batch(vals)
When looking at the probability at the end of a batch, it is exactly the value I get when predicting with the entire batch (not one by one). However, I've expected that the probability will always continue in the right direction when new samples arrive. What actually happens is that the probability output can spike to the wrong class on an arbitrary sample (see below).
Two samples of 100 sample batches over the time of prediction (label = 1):
Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?
Any explanation, advice would be appreciated.
Update
Thanks to @today advice, I've tried training the network with the hidden state output for each input time step using return_sequence=True
on the last LSTM layer.
So now the labels look like so (shape (100,100)):
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
...]
the model summary:
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 4608
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
lstm_3 (LSTM) (None, 100, 8) 800
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 100, 8) 0
_________________________________________________________________
dense_1 (Dense) (None, 100, 1) 9
=================================================================
Total params: 8,553
Trainable params: 8,553
Non-trainable params: 0
_________________________________________________________________
However, I get an exception:
ValueError: Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (75, 100)
What do I need to fix?
Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.
You can do this experiment: set the
return_sequences
argument of last LSTM layer toTrue
and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function likenp.repeat
). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use
np.expand_dims
to add another axis of size one to the end. The correct way of repeating the labels would look like this, assumingy_train
has a shape of(num_samples,)
:The experiment on IMDB dataset:
Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:
Then we can create the stateful replicas of the training models and run them on some test data to compare their results:
Actually, the first sample of
X_test
has a 0 label (i.e. belongs to negative class) and the second sample ofX_test
has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction oftest_model
(i.e. the one that were trained using one label per sample) for these two samples would look like:The result:
Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the
rep_test_model
(i.e. the one that were trained using one label per each timestep):The result:
Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.
Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.