Find closest row of DataFrame to given time in Pan

2020-02-04 02:47发布

I have a Pandas dataframe which is indexed by a DatetimeIndex:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss)       53732  non-null values
Julian_Day                          53732  non-null values
AOT_870                             53732  non-null values
440-870Angstrom                     53732  non-null values
440-675Angstrom                     53732  non-null values
500-870Angstrom                     53732  non-null values
Last_Processing_Date(dd/mm/yyyy)    53732  non-null values
Solar_Zenith_Angle                  53732  non-null values
time                                53732  non-null values
dtypes: datetime64[ns](2), float64(6), object(1)

I want to find the row that is closest to a certain time:

image_time = dateutil.parser.parse('2009-07-28 13:39:02')

and find how close it is. So far, I have tried various things based upon the idea of subtracting the time I want from all of the times and finding the smallest absolute value, but none quite seem to work.

For example:

aeronet.index - image_time

Gives an error which I think is due to +/- on a Datetime index shifting things, so I tried putting the index into another column and then working on that:

aeronet['time'] = aeronet.index
aeronet.time - image_time

This seems to work, but to do what I want, I need to get the ABSOLUTE time difference, not the relative difference. However, just running abs or np.abs on it gives an error:

abs(aeronet.time - image_time)

C:\Python27\lib\site-packages\pandas\core\series.pyc in __repr__(self)
   1061         Yields Bytestring in Py2, Unicode String in py3.
   1062         """
-> 1063         return str(self)
   1064 
   1065     def _tidy_repr(self, max_vals=20):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __str__(self)
   1021         if py3compat.PY3:
   1022             return self.__unicode__()
-> 1023         return self.__bytes__()
   1024 
   1025     def __bytes__(self):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __bytes__(self)
   1031         """
   1032         encoding = com.get_option("display.encoding")
-> 1033         return self.__unicode__().encode(encoding, 'replace')
   1034 
   1035     def __unicode__(self):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __unicode__(self)
   1044                     else get_option("display.max_rows"))
   1045         if len(self.index) > (max_rows or 1000):
-> 1046             result = self._tidy_repr(min(30, max_rows - 4))
   1047         elif len(self.index) > 0:
   1048             result = self._get_repr(print_header=True,

C:\Python27\lib\site-packages\pandas\core\series.pyc in _tidy_repr(self, max_vals)
   1069         """
   1070         num = max_vals // 2
-> 1071         head = self[:num]._get_repr(print_header=True, length=False,
   1072                                     name=False)
   1073         tail = self[-(max_vals - num):]._get_repr(print_header=False,

AttributeError: 'numpy.ndarray' object has no attribute '_get_repr'

Am I approaching this the right way? If so, how should I get abs to work, so that I can then select the minimum absolute time difference, and thus get the closest time. If not, what is the best way to do this with a Pandas time-series?

3条回答
再贱就再见
2楼-- · 2020-02-04 03:15

I was confronting the same problem today. I wanted a function able to give me the closest value prior a given timestamp. Here is the function I got:

def get_nearest_past(data, timestamp):
    index = data.index.get_loc(timestamp,"ffill")
    return data.iloc[index]

In the case that you need the global closest (and not the closest before as in my case), you can use:

def get_nearest(data, timestamp):
    index = data.index.get_loc(timestamp,"nearest")
    return data.iloc[index]

You can find more information in the get_loc documentation.

查看更多
The star\"
3楼-- · 2020-02-04 03:24

I think you can try DatetimeIndex.asof to find the most recent label up to and including the input. Then use the returned datetime to select the appropriate row. If you only need values for a particular column, Series.asof exists and combines the two steps above into one.

This assumes you want the closest datetime. If you don't care about the date and just want the same time every day, use at_time in DataFrame.

Follow up:

Edit: false alarm, I had an older version locally. The latest on master should work with np.abs.

In [10]: np.abs(df.time - image_time)
Out[10]: 
0    27 days, 13:39:02
1    26 days, 13:39:02
2    25 days, 13:39:02
3    24 days, 13:39:02
4    23 days, 13:39:02
5    22 days, 13:39:02

Also just to clarify:

aeronet.index - image_time doesn't work because subtraction on Index is a set difference (back in the day Index used to be constrained to be unique).

查看更多
啃猪蹄的小仙女
4楼-- · 2020-02-04 03:28

This simple method will return the (integer index of the) TimeSeriesIndex entry closest to a given datetime object. There's no need to copy the index to a regular column - simply use the .to_pydatetime method instead.

import numpy as np

i = np.argmin(np.abs(df.index.to_pydatetime() - image_time))

Then you simply use the DataFrame's .iloc indexer:

df.iloc[i]

Here's a function to do this:

def fcl(df, dtObj):
    return df.iloc[np.argmin(np.abs(df.index.to_pydatetime() - dtObj))]

You can then further filter seamlessly, e.g.

fcl(df, dtObj)['column']
查看更多
登录 后发表回答