I'm reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I'm done parsing a file, the DataFrame looks something like this:
Sta Precip1hr Precip5min Temp DewPnt WindSpd WindDir AtmPress
Date
2001-01-01 00:00:00 KPDX 0 0 4 3 0 0 30.31
2001-01-01 00:05:00 KPDX 0 0 4 3 0 0 30.30
2001-01-01 00:10:00 KPDX 0 0 4 3 4 80 30.30
2001-01-01 00:15:00 KPDX 0 0 3 2 5 90 30.30
2001-01-01 00:20:00 KPDX 0 0 3 2 10 110 30.28
The problem I'm having is that sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file. Simple example of such a case is illustrated below:
import pandas
import datetime
startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pandas.DataFrame(data=data1, index=index)
df2 = pandas.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)
df3
A B
2001-01-01 00:00:00 20 -50
2001-01-01 01:00:00 -30 60
2001-01-01 02:00:00 40 -70
2001-01-01 03:00:00 3 3
2001-01-01 04:00:00 4 4
2001-01-01 05:00:00 5 5
2001-01-01 00:00:00 0 0
2001-01-01 01:00:00 1 1
2001-01-01 02:00:00 2 2
And so I need df3
to evenutally become:
A B
2001-01-01 00:00:00 0 0
2001-01-01 01:00:00 1 1
2001-01-01 02:00:00 2 2
2001-01-01 03:00:00 3 3
2001-01-01 04:00:00 4 4
2001-01-01 05:00:00 5 5
I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])
) would help me select out the bottom-most row for any value of the DatetimeIndex
, but I am stuck on figuring out the group_by
or pivot
(or ???) statements to make that work.
If anyone like me likes chainable data manipulation using the pandas dot notation (like piping), then the following may be useful:
This enables chaining statements like this:
Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:
Note, there is a better answer (below) based on the latest Pandas
This should be the accepted answer.
My original answer, which is now outdated, kept for reference.
A simple solution is to use
drop_duplicates
For me, this operated quickly on large data sets.
This requires that 'rownum' be the column with duplicates. In the modified example, 'rownum' has no duplicates, therefore nothing gets eliminated. What we really want is to have the 'cols' be set to the index. I've not found a way to tell drop_duplicates to only consider the index.
Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:
And if you want things back in the proper order, just call
sort
on the dataframe.I would suggest using the duplicated method on the Pandas Index itself:
While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.
Using the sample data provided:
Note that you can keep the last element by changing the keep argument.
It should also be noted that this method works with
MultiIndex
as well (using df1 as specified in Paul's example):Oh my. This is actually so simple!
Follow up edit 2013-10-29 In the case where I have a fairly complex
MultiIndex
, I think I prefer thegroupby
approach. Here's simple example for posterity:and here's the important part