I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
I used this structure as python dict are pretty optimised and it makes programming easier.
for any word 'spam', the documents containig it can be given by :
index['spam'].keys()
and posting list for a document doc1 by:
index['spam']['doc1']
At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.
len(index.keys())
gives 229758
Code
import cPickle as pickle
f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f) # This takes ages
print 'Index loaded. You may now proceed to search'
How can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries.
Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?
Addendum
Using Tim's answer pickle.dump(index, file, -1)
the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())
But should I migrate to a database for scalability ?
As for now I am marking Tim's answer as accepted.
PS :I don't want to use Lucene or Xapian ... This question refers Storing an inverted index . I had to ask a new question because I wasn't able to delete the previous one.
Dependend on how long is 'long' you have to think about the trade-offs you have to make: either have all data ready in memory after (long) startup, or load only partial data (then you need to split up the date in multiple files or use SQLite or something like this). I doubt that loading all data upfront from e.g. sqlite into a dictionary will bring any improvement.
A common pattern in Python 2.x is to have one version of a module implemented in pure Python, with an optional accelerated version implemented as a C extension; for example,
pickle
andcPickle
. This places the burden of importing the accelerated version and falling back on the pure Python version on each user of these modules. In Python 3.0, the accelerated versions are considered implementation details of the pure Python versions. Users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version. The pickle / cPickle pair received this treatment.If your dictionary is huge and should only be compatible with Python 3.4 or higher, use:
or:
That said, in 2010 the
json
module was 25 times faster at encoding and 15 times faster at decoding simple types thanpickle
. My 2014 benchmark saysmarshal
>pickle
>json
, butmarshal's
coupled to specific Python versions.Have you tried using an alternative storage format such as YAML or JSON? Python supports JSON natively from Python 2.6 using the
json
module I think, and there are third party modules for YAML.You may also try the
shelve
module.Try the protocol argument when using
cPickle.dump
/cPickle.dumps
. FromcPickle.Pickler.__doc__
:Converting JSON or YAML will probably take longer than pickling most of the time - pickle stores native Python types.
Do you really need it to load all at once? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at
klepto
.klepto
provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file.klepto
also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.klepto
also has other flags such ascompression
andmemmode
that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc). It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by settingcached=False
.klepto
provides access to customizing your encoding, by building a customkeymap
.klepto
also provides a lot of caching algorithms (likemru
,lru
,lfu
, etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.You can use the flag
cached=False
to turn off memory caching completely, and directly read and write to and from disk or database. If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. Here's an example that does both.However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. You have to play with it to get the right balance for your needs.
Get
klepto
here: https://github.com/uqfoundation