Python - Locating the closest timestamp

2020-06-12 04:28发布

问题:

I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I'm interested in.

I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible.

At the moment I'm doing something like:

for timestamp in timestamps:
    closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))

which works, but takes too long - my index dict has millions of values, and I'm doing the search thousands of times. I'm flexible with data structures and so on - the timestamps are roughly sequential, so that I'm iterating from the first to the last timestamps. Likewise the timestamps in the text file that I load into the dict are sequential.

Any ideas for optimisation would be greatly appreciated.

回答1:

Dictionaries aren't organized for efficient near miss searches. They are designed for exact matches (using a hash table).

You may be better-off maintaining a separate, fast-searchable ordered structure.

A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions:

def nearest(ts):
    # Given a presorted list of timestamps:  s = sorted(index)
    i = bisect_left(s, ts)
    return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t))

A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. You only need this if the dict is going to change over time.

If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps:

 def get_closest_stamp(ts):
      'Speed-up timestamp search by looking only at entries in the same hour'
      hour = round_to_nearest_hour(ts)
      cluster = daydict[hour]         # return a list of entries
      return min(cluster, key=lambda t: abs(ts - t))

Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster.



回答2:

datetime objects are comparable to each other, so make a sorted list of your key/value pairs like this:

myPairs = list(dict.iteritems())
myPairs.sort()

For each element myPairs[i], myPairs[i][0] is the datetime key and myPairs[i][1] is the value.

You can search this list efficiently using bisect_left:

import bisect
i = bisect.bisect_left(myPairs, targetDatetime)

The element myPairs[i] is the element with the lowest datetime no earlier than targetDatetime. But the prior element (if there is one) might be closer in time to targetDatetime. Or targetDatetime might be later than any time in myPairs. So you need to check:

if i > 0 and i == len(myPairs):
    i -= 1
elif i > 0 and targetDatetime - myPairs[i-1][0] < myPairs[i][0]- targetDatetime:
    i -= 1


回答3:

If your list is truly sorted and not just "roughly sequential", you can use a binary search. Have a look at the bisect module documentation for more information.