Implement Causal CNN in Keras for multivariate tim

2020-07-18 05:05发布

问题:

This question is a followup to my previous question here: Multi-feature causal CNN - Keras implementation, however, there are numerous things that are unclear to me that I think it warrants a new question. The model in question here has been built according to the accepted answer in the post mentioned above.

I am trying to apply a Causal CNN model on multivariate time-series data of 10 sequences with 5 features.

lookback, features = 10, 5
  • What should filters and kernel be set to?

    • What is the effect of filters and kernel on the network?
    • Are these just an arbitrary number - i.e. number of neurons in ANN layer?
    • Or will they have an effect on how the net interprets the time-steps?
  • What should dilations be set to?

    • Is this just an arbitrary number or does this represent the lookback of the model?
filters = 32
kernel = 5
dilations = 5
dilation_rates = [2 ** i for i in range(dilations)]

model = Sequential()
model.add(InputLayer(input_shape=(lookback, features)))
model.add(Reshape(target_shape=(features, lookback, 1), input_shape=(lookback, features)))

According to the previously mentioned answer, the input needs to be reshaped according to the following logic:

  • After Reshape 5 input features are now treated as the temporal layer for the TimeDistributed layer
  • When Conv1D is applied to each input feature, it thinks the shape of the layer is (10, 1)

  • with the default "channels_last", therefore...

  • 10 time-steps is the temporal dimension
  • 1 is the "channel", the new location for the feature maps
# Add causal layers
for dilation_rate in dilation_rates:
    model.add(TimeDistributed(Conv1D(filters=filters,
                              kernel_size=kernel,
                              padding='causal',
                              dilation_rate=dilation_rate,
                              activation='elu')))

According to the mentioned answer, the model needs to be reshaped, according to the following logic:

  • Stack feature maps on top of each other so each time step can look at all features produced earlier - (10 time steps, 5 features * 32 filters)

Next, causal layers are now applied to the 5 input features dependently.

  • Why were they initially applied independently?
  • Why are they now applied dependently?
model.add(Reshape(target_shape=(lookback, features * filters)))

next_dilations = 3
dilation_rates = [2 ** i for i in range(next_dilations)]
for dilation_rate in dilation_rates:
    model.add(Conv1D(filters=filters,
                     kernel_size=kernel,
                     padding='causal',
                     dilation_rate=dilation_rate,
                     activation='elu'))
    model.add(MaxPool1D())

model.add(Flatten())
model.add(Dense(units=1, activation='linear'))

model.summary()

SUMMARY

  • What should filters and kernel be set to?
    • Will they have an effect on how the net interprets the time-steps?
  • What should dilations be set to to represent lookback of 10?

  • Why are causal layers initially applied independently?

  • Why are they applied dependently after reshape?
    • Why not apply them dependently from the beginning?

===========================================================================

FULL CODE

lookback, features = 10, 5

filters = 32
kernel = 5
dilations = 5
dilation_rates = [2 ** i for i in range(dilations)]

model = Sequential()
model.add(InputLayer(input_shape=(lookback, features)))
model.add(Reshape(target_shape=(features, lookback, 1), input_shape=(lookback, features)))

# Add causal layers
for dilation_rate in dilation_rates:
    model.add(TimeDistributed(Conv1D(filters=filters,
                              kernel_size=kernel,
                              padding='causal',
                              dilation_rate=dilation_rate,
                              activation='elu')))


model.add(Reshape(target_shape=(lookback, features * filters)))

next_dilations = 3
dilation_rates = [2 ** i for i in range(next_dilations)]
for dilation_rate in dilation_rates:
    model.add(Conv1D(filters=filters,
                     kernel_size=kernel,
                     padding='causal',
                     dilation_rate=dilation_rate,
                     activation='elu'))
    model.add(MaxPool1D())

model.add(Flatten())
model.add(Dense(units=1, activation='linear'))

model.summary()

===========================================================================

EDIT:

Daniel, thank you for your answer.

Question:

If you can explain "exactly" how you're structuring your data, what is the original data and how you're transforming it into the input shape, if you have independent sequences, if you're creating sliding windows, etc. A better understanding of this process could be achieved.

Answer:

I hope I understand your question correctly.

Each feature is a sequence array of time-series data. They are independent, as in, they are not an image, however, they correlate with each other somewhat.

Which is why I am trying to use Wavenet, which is very good at predicting a single time-series array, however, my problem requires me to use multiple multiple features.

回答1:

Comments about the given answer

Questions:

  • Why are causal layers initially applied independently?
  • Why are they applied dependently after reshape?
    • Why not apply them dependently from the beginning?

That answer is sort of strange. I'm not an expert, but I don't see the need to keep independent features with a TimeDistributed layer. But I also cannot say whether it gives a better result or not. At first I'd say it's just unnecessary. But it might bring extra intelligence though, given that it might see relations that involve distant steps between two features instead of just looking at "same steps". (This should be tested)

Nevertheless, there is a mistake in that approach.

The reshapes that are intended to swap lookback and feature sizes are not doing what they are expected to do. The author of the answer clearly wants to swap axes (keeps the interpretation of what is feature, what is lookback), which is different from reshape (mixes everything and data loses meaningfulness)

A correct approach would need actual axis swapping, like model.add(Permute((2,1))) instead of the reshapes.

So, I don't know these answers, but nothing seems to create that need. One sure thing is: you will certainly want the dependent part. A model will not get any near the intelligence of your original model if it doesn't consider relations between features. (Unless you're lucky to have your data completely independent)

Now, explaining the relation between LSTM and Conv1D

An LSTM can be directly compared to a Conv1D and the shapes used are exactly the same, and they mean virtually the same, as long as you're using channels_last.

That said, the shape (samples, input_length, features_or_channels) is the correct shape for both LSTM and Conv1D. In fact, features and channels are exactly the same thing in this case. What changes is how each layer works regarding the input length and calculations.

Concept of filters and kernels

Kernel is the entire tensor inside the conv layer that will be multiplied to the inputs to get the results. A kernel includes its spatial size (kernel_size) and number of filters (output features). And also automatic input filters.

There is not a number of kernels, but there is a kernel_size. The kernel size is how many steps in the length will be joined together for each output step. (This tutorial is great for undestanding 2D convolutions regarding what it does and what the kernel size is - just imagine 1D images instead -- this tutorial doesn't show the number of "filters" though, it's like 1-filter animations)

The number of filters relates directly to the number of features, they're exactly the same thing.

What should filters and kernel be set to?

So, if your LSTM layer is using units=256, meaning it will output 256 features, you should use filters=256, meaning your convolution will output 256 channels/features.

This is not a rule, though, you may find that using more or less filters could bring better results, since the layers do different things after all. There is no need to have all layers with the same number of filters as well!! Here you should go with a parameter tuning. Test to see which numbers are best for your goal and data.

Now, kernel size is something that can't be compared to the LSTM. It's a new thing added to the model.

The number 3 is sort of a very common choice. It means that the convolution will take three time steps to produce one time step. Then slide one step to take another group of three steps to produce the next step and so on.

Dilations

Dilations mean how many spaces between steps the convolution filter will have.

  • A convolution dilation_rate=1 takes kernel_size consecutive steps to produce one step.
  • A convolution with dilation_rate = 2 takes, for instance, steps 0, 2 and 4 to produce a step. Then takes steps 1,3,5 to produce the next step and so on.

What should dilations be set to to represent lookback of 10?

range = 1 + (kernel_size - 1) * dilation_rate

So, with a kernel size = 3:

  • Dilation = 0 (dilation_rate=1): the kernel size will range 3 steps
  • Dilation = 1 (dilation_rate=2): the kernel size will range 5 steps
  • Dilation = 2 (dilation_rate=4): the kernel size will range 9 steps
  • Dilation = 3 (dilation_rate=8): the kernel size will range 17 steps

My question to you

If you can explain "exactly" how you're structuring your data, what is the original data and how you're transforming it into the input shape, if you have independent sequences, if you're creating sliding windows, etc. A better understanding of this process could be achieved.