HyperTable: Loading data using Mutators Vs. LOAD D

I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.

I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).

My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.

See below for actual snapshot of action:

hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;


Loading 7,113,154,337 bytes of input data...                    

0%   10   20   30   40   50   60   70   80   90   100%          
|----|----|----|----|----|----|----|----|----|----|             
***************************************************             
Load complete.                                                  

 Elapsed time:  508.07 s                                       
 Avg key size:  8.92 bytes                                     
  Total cells:  218976067                                      
   Throughput:  430998.80 cells/s                              
      Resends:  2210404                                        


hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;

Loading 12,693,476,187 bytes of input data...                    

0%   10   20   30   40   50   60   70   80   90   100%           
|----|----|----|----|----|----|----|----|----|----|
***************************************************              
Load complete.                                                   

 Elapsed time:  1189.71 s                                       
 Avg key size:  17.48 bytes                                     
  Total cells:  437952134                                       
   Throughput:  368118.13 cells/s                               
      Resends:  1483209

Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:

my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);

I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.

Thanks.

If it's taking three days to load one million rows, then you're probably calling flush() after every row insert, which is not the right thing to do. Before I describe hot to fix that, your mutator_open() arguments aren't quite right. You don't need to specify ignore_unknown_cfs and you should supply 0 for the flush_interval, something like this:

my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);

You should only call mutator_flush() if you would like to checkpoint how much of the input data has been consumed. A successful call to mutator_flush() means that all data that has been inserted on that mutator has durably made it into the database. If you're not checkpointing how much of the input data has been consumed, then there is no need to call mutator_flush(), since it will get flushed automatically when you close the mutator.

The next performance problem with your code that I see is that you're using mutator_set_cell(). You should use either mutator_set_cells() or mutator_set_cells_as_arrays() since each method call is a round-trip to the ThriftBroker, which is expensive. By using the mutator_set_cells_* methods, you amortize that round-trip over many cells. The mutator_set_cells_as_arrays() method can be more efficient for languages where object construction overhead is large in comparison to native datatypes (e.g. string). I'm not sure about Perl, but you might want to give that a try to see if it boosts performance.

Also, be sure to call mutator_close() when you're finished with the mutator.