Overriding TableMapper splits

I am using the following code to read from a table which has its row keys having a format of "epoch_meter" where epoch is the long representation of the date time in seconds and meter is a meter number.



Job jobCalcDFT = Job.getInstance(confCalcIndDeviation);

jobCalcDFT.setJarByClass(CalculateIndividualDeviation.class);

Scan scan = new Scan(Bytes.toBytes(String.valueOf(startSeconds) + "_"),
Bytes.toBytes(String.valueOf(endSeconds + 1) + "_"));

scan.setCaching(500);

scan.setCacheBlocks(false);

scan.addColumn(Bytes.toBytes("readings"), Bytes.toBytes("halfhourly"));

TableMapReduceUtil.initTableMapperJob("meterreadings", 
scan, EmitDFTMapper.class,
MeterIdFrequencyKey.class, 
ComplexWritable.class, jobCalcDFT);

I can mention start row and end row. But I am not being able to find much information on how to controls the splits.

So, the meterreadings table has 100 million rows. The value in each row is just 32 bytes (a float value). That would be around 3.2 GB (I am not considering the space for the keys - if I consider the keys then considering each key is a string value of around 20 to 30 characters - that will probably add 60 bytes per row). Now I am not sure now internally HBase will compress this, but if I do not consider compressing, that 3.2GB should be split up into quite a few mappers. Comparing with HDFS splitting, if I consider a 128MB split, it should give me around 25 TableMappers.

Now the startrow endrow combination I am using, is checking for around 1/25th of that 100 million recods. And, consequently, I am seeing only 2 TableMappers being used for this job. Don't know if that's how the calculation works, its a guess.

But it is still around 4 million rows, and two mappers is making the job run very slow. Can anyone tell me how can I change the splitting (now that TableInputFormat is deprecated) such that there are more TableMappers reading the rows.

Thanks Regards

You need to use your own TableMapReduceUtil (or copy HBase's) and use a subclass TableInputFormatBase and override getSplits to return more splits than 1 per region (the default behavior) - you can find the code for TableInputFormatBase here

By the way the TableInputFormat that is depracated is in org.apache.hadoop.hbase.mapred namespace not the org.apache.hadoop.hbase.mapreduce namespace