Python - how to normalize time-series data

2020-07-10 03:10发布

I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python

5条回答
姐就是有狂的资本
2楼-- · 2020-07-10 03:15

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

查看更多
beautiful°
3楼-- · 2020-07-10 03:21

Assuming that your timeseries is an array, try something like this:

(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())

This will confine your values between 0 and 1

查看更多
来,给爷笑一个
4楼-- · 2020-07-10 03:22

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):

V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50

Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF

查看更多
冷血范
5楼-- · 2020-07-10 03:22

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization: ( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.

This function is slow, so it’s advisable run it just once and store the results.

    from scipy.stats import norm
    import pandas as pd

    def get_NormArray(df, n, mode = 'total', linear = False):
        '''
                 It computes the normalized value on the stats of n values ( Modes: total or scale ) 
                 using the formulas from the book "Statistically sound machine learning..."
                 (Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
                 It is modified to fit the data from -1 to 1 instead of -100 to 100
                 df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
                 n define the number of data points to get the mean and the quartiles for the normalization
                 modes: scale: scale, without centering. total:  center and scale.
         '''
        temp =[]

        for i in range(len(df))[::-1]:

            if i  >= n: # there will be a traveling norm until we reach the initian n values. 
                        # those values will be normalized using the last computed values of F50,F75 and F25
                F50 = df[i-n:i].quantile(0.5)
                F75 =  df[i-n:i].quantile(0.75)
                F25 =  df[i-n:i].quantile(0.25)

            if linear == True and mode == 'total':
                 v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
            elif linear == True and mode == 'scale':
                 v =  0.25 * df.iloc[i]/(F75-F25) -0.5
            elif linear == False and mode == 'scale':
                 v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5

            else: # even if strange values are given, it will perform full normalization with compression as default
                v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5

            temp.append(v[0])
        return  pd.DataFrame(temp[::-1])
查看更多
倾城 Initia
6楼-- · 2020-07-10 03:27
from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)

You can take a look here normalize-standardize-time-series-data-python and sklearn.preprocessing.minmax_scale

查看更多
登录 后发表回答