Trendline plotting not working with bigdataset

2019-05-26 02:28发布

问题:

I have a big dataset with 52166 datapoints and which looks like this:

                     bc_conc    
2010-04-09 10:00:00  609.542000          
2010-04-09 11:00:00  663.500000          
2010-04-09 12:00:00  524.661667         
2010-04-09 13:00:00  228.706667           
2010-04-09 14:00:00  279.721667         

It is a pandas dataframe and the index is on the datetime. Now I like to plot the data of bc_conc against the time and add a trendline.

I used the following code:

data = data.resample('M', closed='left', label='left').mean()
x1 = data.index
x2 = matplotlib.dates.date2num(data.index.to_pydatetime())
y = data.bc_conc
z = np.polyfit(x2, y, 1)
p = np.poly1d(z)
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
plt.plot_date(x=x1, y=y, fmt='b-')
plt.plot(x1, p(x2), 'ro')
plt.show()

However, as you can see I resampled my data. I did this because of I don't, the code just gives me a plot of the data without the trendline. If I resample them to days the plot is still without trendline. If I resample them to months, a trendline shows.

It seems as if the code only works for a smaller dataset. Why is this? I was wondering of anyone could explain this to me, because I like to resample my data to days, but not further..

Thanks in advance

回答1:

This code seems to work fine, whether using hourly or daily resampled data.

Starting with 100,000 data points:

y = np.arange(0, 1000, .01) + np.random.normal(0, 100, 100000)
data = pd.DataFrame(data={'bc_conc': y}, index=pd.date_range(freq='H', start=datetime(2000, 1, 1), periods=len(y)))

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2000-01-01 00:00:00 to 2011-05-29 15:00:00
Freq: H
Data columns (total 1 columns):
bc_conc    100000 non-null float64
dtypes: float64(1)

                        bc_conc
2000-01-01 00:00:00  -30.639811
2000-01-01 01:00:00  -26.791396
2000-01-01 02:00:00 -121.542718
2000-01-01 03:00:00  -69.267944
2000-01-01 04:00:00  117.731532

Calculation of trendline with optional resampling:

data = data.resample('D', closed='left', label='left').mean() # optional for daily data
x2 = matplotlib.dates.date2num(data.index.to_pydatetime()) # Dates to float representing (fraction of) days since 0001-01-01 00:00:00 UTC plus one

[ 730120.  730121.  730122. ...,  734284.  734285.  734286.]

z = np.polyfit(x2, data.bc_conc, 1)

[  2.39988999e-01  -1.75220741e+05]  # coefficients

p = np.poly1d(z)

0.24 x - 1.752e+05 # fitted polynomial

data['trend'] = p(x2)  # trend from polynomial fit

              bc_conc     trend
2000-01-01 -29.794608  0.026983
2000-01-02   6.727729  0.266972
2000-01-03   9.815476  0.506961
2000-01-04 -27.954068  0.746950
2000-01-05 -13.726714  0.986939

data.plot()
plt.show()

Yields: