I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).
My prototype uses Python's set as the obvious data type.
When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.
Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.
I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.
I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.
Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?
Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)
(Edit: to clarify what I'm looking for from an API, I want something like:
def merge(document_sets):
probe_i = 0
probe_set = document_sets[probe_i]
document_id = GET_FIRST(probe_set)
while IS_VALID(document_id):
# See if the document is present in all sets
for i in range(1, len(document_sets)):
# dynamically adapt to favor the least matching set
target_i = (i + probe_i) % len(document_sets)
target = document_sets[target_i]
if document_id not in target_set:
probe_i = target_id
probe_set = document_sets[probe_i]
document_id = GET_NEXT(probe_set, document_id)
break
else:
yield document_id
I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.
This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.) )
I've recently added iteration support to datrie, you may give it a try.
Yes, there are some,
though I'm not sure if they're suitable for your use case:but it seems none of them are what you asked for.BioPython has a Trie implementation in C.
Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520
Other (some very stale) implementations:
http://pypi.python.org/pypi/radix
https://bitbucket.org/markon/patricia-tree/src
http://pypi.python.org/pypi/trie
http://pypi.python.org/pypi/logilab-common/0.50.3