I have a managed Hive table, which contains only one 150MB file. I then do "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to a bigger number.
First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I also used 'set dfs.block.size=8388608;', not working either.
Then I tried a vanilla map-reduce job to do the same thing. It initially uses 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem lies in Hive, I suppose.
I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, etc. can't find a clue.
What else settings can I use?
I combined @javadba 's answer with that I received from Hive mailing list, here's the solution:
From the mailing list:
I would dig into source code later.
Try adding the following: