I am creating a DataFrame from a csv as follows:
stock = pd.read_csv(\'data_in/\' + filename + \'.csv\', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
- Use a boolean mask, then use
df.loc[mask]
- Set the date column as a DatetimeIndex, then use
df[start_date : end_date]
Using a boolean mask:
Ensure df[\'date\']
is a Series with dtype datetime64[ns]
:
df[\'date\'] = pd.to_datetime(df[\'date\'])
Make a boolean mask. start_date
and end_date
can be datetime.datetime
s,
np.datetime64
s, pd.Timestamp
s, or even datetime strings:
mask = (df[\'date\'] > start_date) & (df[\'date\'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df[\'date\'] = pd.date_range(\'2000-1-1\', periods=200, freq=\'D\')
mask = (df[\'date\'] > \'2000-6-1\') & (df[\'date\'] <= \'2000-6-10\')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date
column as the index first. Then you can select rows by date using
df.loc[start_date:end_date]
.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df[\'date\'] = pd.date_range(\'2000-1-1\', periods=200, freq=\'D\')
df = df.set_index([\'date\'])
print(df.loc[\'2000-6-1\':\'2000-6-10\'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end]
includes start
but not end
, in contrast, Pandas df.loc[start_date : end_date]
includes both end-points in the result if they are in the index. Neither start_date
nor end_date
has to be in the index however.
Also note that pd.read_csv
has a parse_dates
parameter which you could use to parse the date
column as datetime64
s. Thus, if you use parse_dates
, you would not need to use df[\'date\'] = pd.to_datetime(df[\'date\'])
.
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df[\'date\'] > \'2000-6-1\') & (df[\'date\'] <= \'2000-6-10\')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can use the isin
method on the date
column like so
df[df[\"date\"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df[\'date\'] = pd.date_range(\'2017-1-1\', periods=30, freq=\'D\')
# Select the rows between two dates
in_range_df = df[df[\"date\"].isin(pd.date_range(\"2017-01-15\", \"2017-01-20\"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
You can also use between
:
df[df.some_date.between(start_date, end_date)]
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index(\'date\')
df = data_frame[(data_frame.index > \'2017-08-10\') & (data_frame.index <= \'2017-08-15\')]
I prefer not to alter the df
.
An option is to retrieve the index
of the start
and end
dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df[\'date\'] = pd.date_range(\'2017-1-1\', periods=30, freq=\'D\')
#Get the index of the start and end dates respectively
start = df[df[\'date\']==\'2017-01-07\'].index[0]
end = df[df[\'date\']==\'2017-01-14\'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14