Let's suppose I have a sequence of integers:
0,1,2, ..
and want to predict the next integer given the last 3 integers, e.g.:
[0,1,2]->5
, [3,4,5]->6
, etc
Suppose I setup my model like so:
batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))
It is my understanding that model has the following structure (please excuse the crude drawing):
First Question: is my understanding correct?
Note I have drawn the previous states C_{t-1}, h_{t-1}
entering the picture as this is exposed when specifying stateful=True
. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).
This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.
For example:
batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc
This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:
From the tensorflow docs:
stateful: Boolean (default False). If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.
it seems this "internal" state isn't available and all that is available is the final state. See this figure:
So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True
? E.g.:
batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc
The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.
Batch vs. sample mechanism ("see AI" = see "additional info" section)
All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension,
batch_shape[0]
, and the features dimensions,batch_shape[1:]
- the two "don't talk." The only relation between the two is via the gradient (see AI).Overlap vs no-overlap batch
Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?
Take 10 samples, shape
(240000, 1)
. How to feed?(10, 54000, 1)
, all samples included, slicing assample[0:54000]; sample[54000:108000]
...(10, 54000, 1)
, all samples included, slicing assample[0:54000]; sample[1:54001]
...Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):
(10, 54000, 1)
, all samples included, slicing assample[0:54000]; sample[24000:81000]
...A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.
Prediction: overlap bad?
If you are doing a one-step prediction, the information landscape is now changed:
This dramatically changes your loss function, and what is 'good practice' for minimizing it:
What should I do?
First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:
Your goal: balance the two; 1's main edge over 2 is:
Should I ever use (2) in prediction?
LSTM stateful: may actually be entirely useless for your problem.
Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:
t0=seq[0:50]; t1=seq[50:100]
makes sense;t0
logically leads tot1
seq[0:50] --> seq[1:51]
makes no sense;t1
doesn't causally derive fromt0
In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.
When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:
t0, t1
as in above's first bullet.lr = 0
.When and how does LSTM "pass states" in stateful?
stateful=True
requires you to specifybatch_shape
instead ofinput_shape
- because, Keras buildsbatch_size
separate states of the LSTM at compilingPer above, you cannot do this:
This implies
21
causally follows10
- and will wreck training. Instead do:Batch vs. sample: additional info
A "batch" is a set of samples - 1 or greater (assume always latter for this answer) . Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:
BONUS DIAGRAMS: