Python - how to normalize time-series data

I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python

标签： python time-series

5条回答

姐就是有狂的资本

2楼-- · 2020-07-10 03:15

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2020-07-10 03:21

Assuming that your timeseries is an array, try something like this:

(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())

This will confine your values between 0 and 1

0人赞添加讨论(0) 举报

来，给爷笑一个

4楼-- · 2020-07-10 03:22

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):

V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50

Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF

0人赞添加讨论(0) 举报

冷血范

5楼-- · 2020-07-10 03:22

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization: ( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.

This function is slow, so it’s advisable run it just once and store the results.

    from scipy.stats import norm
    import pandas as pd

    def get_NormArray(df, n, mode = 'total', linear = False):
        '''
                 It computes the normalized value on the stats of n values ( Modes: total or scale ) 
                 using the formulas from the book "Statistically sound machine learning..."
                 (Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
                 It is modified to fit the data from -1 to 1 instead of -100 to 100
                 df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
                 n define the number of data points to get the mean and the quartiles for the normalization
                 modes: scale: scale, without centering. total:  center and scale.
         '''
        temp =[]

        for i in range(len(df))[::-1]:

            if i  >= n: # there will be a traveling norm until we reach the initian n values. 
                        # those values will be normalized using the last computed values of F50,F75 and F25
                F50 = df[i-n:i].quantile(0.5)
                F75 =  df[i-n:i].quantile(0.75)
                F25 =  df[i-n:i].quantile(0.25)

            if linear == True and mode == 'total':
                 v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
            elif linear == True and mode == 'scale':
                 v =  0.25 * df.iloc[i]/(F75-F25) -0.5
            elif linear == False and mode == 'scale':
                 v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5

            else: # even if strange values are given, it will perform full normalization with compression as default
                v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5

            temp.append(v[0])
        return  pd.DataFrame(temp[::-1])

0人赞添加讨论(0) 举报

倾城　Initia

6楼-- · 2020-07-10 03:27

from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)

You can take a look here normalize-standardize-time-series-data-python and sklearn.preprocessing.minmax_scale

0人赞添加讨论(0) 举报

Python - how to normalize time-series data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间