I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I'm interested in.
I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible.
At the moment I'm doing something like:
for timestamp in timestamps:
closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))
which works, but takes too long - my index dict has millions of values, and I'm doing the search thousands of times. I'm flexible with data structures and so on - the timestamps are roughly sequential, so that I'm iterating from the first to the last timestamps. Likewise the timestamps in the text file that I load into the dict are sequential.
Any ideas for optimisation would be greatly appreciated.
Dictionaries aren't organized for efficient near miss searches. They are designed for exact matches (using a hash table).
You may be better-off maintaining a separate, fast-searchable ordered structure.
A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions:
def nearest(ts):
# Given a presorted list of timestamps: s = sorted(index)
i = bisect_left(s, ts)
return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t))
A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. You only need this if the dict is going to change over time.
If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps:
def get_closest_stamp(ts):
'Speed-up timestamp search by looking only at entries in the same hour'
hour = round_to_nearest_hour(ts)
cluster = daydict[hour] # return a list of entries
return min(cluster, key=lambda t: abs(ts - t))
Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster.
datetime objects are comparable to each other, so make a sorted list of your key/value pairs like this:
myPairs = list(dict.iteritems())
myPairs.sort()
For each element myPairs[i]
, myPairs[i][0]
is the datetime
key and myPairs[i][1]
is the value.
You can search this list efficiently using bisect_left
:
import bisect
i = bisect.bisect_left(myPairs, targetDatetime)
The element myPairs[i]
is the element with the lowest datetime no earlier than targetDatetime
. But the prior element (if there is one) might be closer in time to targetDatetime
. Or targetDatetime
might be later than any time in myPairs
. So you need to check:
if i > 0 and i == len(myPairs):
i -= 1
elif i > 0 and targetDatetime - myPairs[i-1][0] < myPairs[i][0]- targetDatetime:
i -= 1
If your list is truly sorted and not just "roughly sequential", you can use a binary search. Have a look at the bisect
module documentation for more information.