I'm thinking about using HBase as a source for one of my MapReduce jobs. I know that TableInputFormat specifies one input split (and thus one mapper) per Region. However, this seems inefficient. I'd really like to have multiple mappers working on a given Region at once. Can I achieve this by extending TableInputFormatBase? Can you please point me to an example? Furthermore, is this even a good idea?
Thanks for the help.
Not sure if you can specify multiple mappers for a given region, but consider the following:
If you think one mapper is inefficient per region (maybe your data nodes don't have enough resources like #cpus), you can perhaps specify smaller regions sizes in the file hbase-site.xml.
here's a site for the default configs options if you want to look into changing that: http://hbase.apache.org/configuration.html#hbase_default_configurations
please note that by making the region size small, you will be increasing the number of files in your DFS, and this can limit the capacity of your hadoop DFS depending on the memory of your namenode. Remember, the namenode's memory usage is directly related to the number of files in your DFS. This may or may not be relavant to your situation as I do not know how your cluster is being used. There is never a silver bullet answer to these questions!
1 . Its absolutely fine just make sure the key set is mutually exclusive between the mappers .
Using this MultipleScanTableInputFormat, you can use MultipleScanTableInputFormat.PARTITIONS_PER_REGION_SERVER configuration to control how many mappers should execute against a single regionserver. The class will group all the input splits by their location (regionserver), and the RecordReader will properly iterate through all aggregated splits for the mapper.
Here is the example
https://gist.github.com/bbeaudreault/9788499#file-multiplescantableinputformat-java-L90
That work you have created the multiple aggregated splits for a single mapper
Create partition by Region server
This would be useful if you wanna scan large regions (hundred of millions rows) with conditioned scan that finds only a few records. This will prevent of ScannerTimeoutException
You need a custom input format that extends InputFormat. you can get idea how do this from answer to question I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster. This is a good idea if the time of data processing is more greater then data retrieving time.