I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).
In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.
The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.
This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.
from sys import getsizeof
def time_framer(data_to_frame, window_size=1):
"""It transforms a 2d dataset into 3d based on a specific size;
original function can be found at:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
"""
n_datapoints = data_to_frame.shape[0] - window_size
framed_data = np.empty(
shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)
for index in range(n_datapoints):
framed_data[index] = data_to_frame[index:(index + window_size)]
print(framed_data.shape)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty quality test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.
I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.
from numpy.lib.stride_tricks import as_strided
def strides_trick_time_framer(data_to_frame, window_size=1):
new_strides = (data_to_frame.strides[0],
data_to_frame.strides[0]*data_to_frame.shape[1] ,
data_to_frame.strides[0]*window_size)
n_datapoints = data_to_frame.shape[0] - window_size
print('striding.....')
framed_data = as_strided(data_to_frame,
shape=(n_datapoints, # .flatten() here did not change the outcome
window_size,
data_to_frame.shape[1]),
strides=new_strides).astype(np.float32)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
Any help would be highly appreciated!
You can use the stride template function
window_nd
I made hereThen to stride over just the first dimension you just need
Haven't found a built-in window function yet that can work over arbitrary axes, so unless there's been a new one implemented in
scipy.signal
orskimage
recently, that's probably your best bet.EDIT: To see the memory savings, you will need to use the method described by @ali_m here as the basic
ndarray.nbytes
is naive to shared memory.For this
X
:this
as_strided
produces the same array as yourtime_framer
It strides the last dimension just like
X
. And 2nd to the last as well. The first advances one row, so it too getsX.strides[0]
. So the window size only affects the shape, not the strides.So in your
as_strided
version just use:Minor corrections. Set the default window size to 2 or larger. 1 produces an indexing error in the test.
Looking a
getsizeof
:Wait, why is
X
size smaller thannbytes
? Because it is aview
(see line [734] above).As noted in another SO,
getsizeof
has to be used with caution:Why the size of numpy array is different?
Now for the expanded copy:
and the strided version
x1
size is just like a view (128 because its 3d). But if we try to change itsdtype
, it makes a copy, and the strides and size are the same asx2
.Many operations on
x1
will loose the strided size advantage,x1.ravel()
,x1+1
etc. It's mainly reduction operations likemean
andsum
that produce a real space savings.