I have about 30 text files with the structure
wordleft1|wordright1
wordleft2|wordright2
wordleft3|wordright3
...
The total size of the files is about 1 GB with about 32 million lines of word combinations.
I tried a few approaches to load them as fast as possible and store the combinations within a hash
$hash{$wordleft} = $wordright
Opening file by file and reading line by line takes about 42 seconds. I then store the hash with the Storable module
store \%hash, $filename
Loading the data again
$hashref = retrieve $filename
reduces the time to about 28 seconds. I use a fast SSD drive and a fast CPU and have enough RAM to hold all the data (it takes about 7 GB).
I'm searching for a faster way to load this data into the RAM (I can't keep it there for a few reasons).
You could try using Dan Bernstein's CDB file format using a tied hash, which will require minimal code change. You may need to install CDB_File. On my laptop, the cdb file is opened very quickly and I can do about 200-250k lookups per second. Here is an example script to create/use/benchmark a cdb:
test_cdb.pl
Output ( 1 million keys, tested over 10 seconds )
Output ( 10 million keys, tested over 10 seconds )
It sounds like you do have a good use case for wanting an in-memory perl hash.
For faster storing/retrieving, I would recommend Sereal (Sereal::Encoder/Sereal::Decoder). If your disk storage is slow, you may even want to enable Snappy compression.