fast loading of large hash table in Perl

I have about 30 text files with the structure

wordleft1|wordright1
wordleft2|wordright2
wordleft3|wordright3
...

The total size of the files is about 1 GB with about 32 million lines of word combinations.

I tried a few approaches to load them as fast as possible and store the combinations within a hash

$hash{$wordleft} = $wordright

Opening file by file and reading line by line takes about 42 seconds. I then store the hash with the Storable module

store \%hash, $filename

Loading the data again

$hashref = retrieve $filename

reduces the time to about 28 seconds. I use a fast SSD drive and a fast CPU and have enough RAM to hold all the data (it takes about 7 GB).

I'm searching for a faster way to load this data into the RAM (I can't keep it there for a few reasons).

标签： performance perl hash

2条回答

冷血范

2楼-- · 2019-04-10 10:27

You could try using Dan Bernstein's CDB file format using a tied hash, which will require minimal code change. You may need to install CDB_File. On my laptop, the cdb file is opened very quickly and I can do about 200-250k lookups per second. Here is an example script to create/use/benchmark a cdb:

test_cdb.pl

#!/usr/bin/env perl

use warnings;
use strict;

use Benchmark qw(:all) ;
use CDB_File 'create';
use Time::HiRes qw( gettimeofday tv_interval );

scalar @ARGV or die "usage: $0 number_of_keys seconds_to_benchmark\n";
my ($size)    = $ARGV[0] || 1000;
my ($seconds) = $ARGV[1] || 10;

my $t0;
tic();

# Create CDB
my ($file, %data);

%data = map { $_ => 'something' } (1..$size);
print "Created $size element hash in memory\n";
toc();

$file = 'data.cdb';
create %data, $file, "$file.$$";
my $bytes = -s $file;
print "Created data.cdb [ $size keys and values, $bytes bytes]\n";
toc();

# Read from CDB
my $c = tie my %h, 'CDB_File', 'data.cdb' or die "tie failed: $!\n";
print "Opened data.cdb as a tied hash.\n";
toc();

timethese( -1 * $seconds, {
          'Pick Random Key'    => sub { int rand $size },
          'Fetch Random Value' => sub { $h{ int rand $size }; },
});

tic();
print "Fetching Every Value\n";
for (0..$size) {
    no warnings; # Useless use of hash element
    $h{ $_ };
}
toc();

sub tic {
    $t0 = [gettimeofday];    
}

sub toc {
    my $t1 = [gettimeofday];
    my $elapsed = tv_interval ( $t0, $t1);
    $t0 = $t1;
    print "==> took $elapsed seconds\n";
}

Output ( 1 million keys, tested over 10 seconds )

./test_cdb.pl 1000000 10

Created 1000000 element hash in memory
==> took 2.882813 seconds
Created data.cdb [ 1000000 keys and values, 38890944 bytes]
==> took 2.333624 seconds
Opened data.cdb as a tied hash.
==> took 0.00015 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 10 wallclock secs (10.46 usr +  0.01 sys = 10.47 CPU) @ 236984.72/s (n=2481230)
Pick Random Key:  9 wallclock secs (10.11 usr +  0.02 sys = 10.13 CPU) @ 3117208.98/s (n=31577327)
Fetching Every Value
==> took 3.514183 seconds

Output ( 10 million keys, tested over 10 seconds )

./test_cdb.pl 10000000 10

Created 10000000 element hash in memory
==> took 44.72331 seconds
Created data.cdb [ 10000000 keys and values, 398890945 bytes] 
==> took 25.729652 seconds
Opened data.cdb as a tied hash.
==> took 0.000222 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 14 wallclock secs ( 9.65 usr +  0.35 sys = 10.00 CPU) @ 209811.20/s (n=2098112)
Pick Random Key: 12 wallclock secs (10.40 usr +  0.02 sys = 10.42 CPU) @ 2865335.22/s (n=29856793)
Fetching Every Value
==> took 38.274356 seconds

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2019-04-10 10:29

It sounds like you do have a good use case for wanting an in-memory perl hash.

For faster storing/retrieving, I would recommend Sereal (Sereal::Encoder/Sereal::Decoder). If your disk storage is slow, you may even want to enable Snappy compression.

0人赞添加讨论(0) 举报

fast loading of large hash table in Perl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间