I have a big dataset with 52166 datapoints and which looks like this:
bc_conc
2010-04-09 10:00:00 609.542000
2010-04-09 11:00:00 663.500000
2010-04-09 12:00:00 524.661667
2010-04-09 13:00:00 228.706667
2010-04-09 14:00:00 279.721667
It is a pandas dataframe and the index is on the datetime. Now I like to plot the data of bc_conc against the time and add a trendline.
I used the following code:
data = data.resample('M', closed='left', label='left').mean()
x1 = data.index
x2 = matplotlib.dates.date2num(data.index.to_pydatetime())
y = data.bc_conc
z = np.polyfit(x2, y, 1)
p = np.poly1d(z)
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
plt.plot_date(x=x1, y=y, fmt='b-')
plt.plot(x1, p(x2), 'ro')
plt.show()
However, as you can see I resampled my data. I did this because of I don't, the code just gives me a plot of the data without the trendline. If I resample them to days the plot is still without trendline. If I resample them to months, a trendline shows.
It seems as if the code only works for a smaller dataset. Why is this? I was wondering of anyone could explain this to me, because I like to resample my data to days, but not further..
Thanks in advance
This code seems to work fine, whether using hourly or daily resampled data.
Starting with 100,000 data points:
y = np.arange(0, 1000, .01) + np.random.normal(0, 100, 100000)
data = pd.DataFrame(data={'bc_conc': y}, index=pd.date_range(freq='H', start=datetime(2000, 1, 1), periods=len(y)))
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2000-01-01 00:00:00 to 2011-05-29 15:00:00
Freq: H
Data columns (total 1 columns):
bc_conc 100000 non-null float64
dtypes: float64(1)
bc_conc
2000-01-01 00:00:00 -30.639811
2000-01-01 01:00:00 -26.791396
2000-01-01 02:00:00 -121.542718
2000-01-01 03:00:00 -69.267944
2000-01-01 04:00:00 117.731532
Calculation of trendline with optional resampling:
data = data.resample('D', closed='left', label='left').mean() # optional for daily data
x2 = matplotlib.dates.date2num(data.index.to_pydatetime()) # Dates to float representing (fraction of) days since 0001-01-01 00:00:00 UTC plus one
[ 730120. 730121. 730122. ..., 734284. 734285. 734286.]
z = np.polyfit(x2, data.bc_conc, 1)
[ 2.39988999e-01 -1.75220741e+05] # coefficients
p = np.poly1d(z)
0.24 x - 1.752e+05 # fitted polynomial
data['trend'] = p(x2) # trend from polynomial fit
bc_conc trend
2000-01-01 -29.794608 0.026983
2000-01-02 6.727729 0.266972
2000-01-03 9.815476 0.506961
2000-01-04 -27.954068 0.746950
2000-01-05 -13.726714 0.986939
data.plot()
plt.show()
Yields: