I am working on a project on Info Retrieval.
I have made a Full Inverted Index using Hadoop/Python.
Hadoop outputs the index as (word,documentlist) pairs which are written on the file.
For a quick access, I have created a dictionary(hashtable) using the above file.
My question is, how do I store such an index on disk that also has quick access time.
At present I am storing the dictionary using python pickle module and loading from it
but it brings the whole of index into memory at once (or does it?).
Please suggest an efficient way of storing and searching through the index.
My dictionary structure is as follows (using nested dictionaries)
{word : {doc1:[locations], doc2:[locations], ....}}
so that I can get the documents containing a word by
dictionary[word].keys() ... and so on.
shelve
At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).
Yes it does bring it all in.
Is that a problem? If it's not an actual problem, then stick with it.
If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?
I would use Lucene. Why reinvent the wheel?
Just store it in a string like this:
<entry1>,<entry2>,<entry3>,...,<entryN>
If <entry*>
contains ',' character, use some other delimiter like '\t'.
This is smaller in size than an equivalent pickled string.
If you want to load it, just do:
L = s.split(delimiter)
You could store the repr() of the dictionary and use that to re-create it.
If it's taking a long time to load or using too much memory, you might need a database. There are many you might use; I would probably start with SQLite. Then your problem is "reduced" ;-) to simply formulating the right query to get what you need out of the database. This way you will only load what you need.
I am using anydmb for that purpose. Anydbm provides the same dictionary-like interface, except it allow only strings as keys and values. But this is not a constraint since you can use cPickle's loads/dumps to store more complex structures in the index.