Fail to Increase Hive Mapper Tasks?

2019-02-20 07:45发布

I have a managed Hive table, which contains only one 150MB file. I then do "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to a bigger number.

First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I also used 'set dfs.block.size=8388608;', not working either.

Then I tried a vanilla map-reduce job to do the same thing. It initially uses 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem lies in Hive, I suppose.

I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, etc. can't find a clue.

What else settings can I use?

标签: hadoop hive
2条回答
女痞
2楼-- · 2019-02-20 07:51

I combined @javadba 's answer with that I received from Hive mailing list, here's the solution:

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapred.map.tasks = 20;
select count(*) from dw_stage.st_dw_marketing_touch_pi_metrics_basic;

From the mailing list:

It seems that HIVE is using the old Hadoop MapReduce API and so mapred.max.split.size won't work.

I would dig into source code later.

查看更多
Melony?
3楼-- · 2019-02-20 07:52

Try adding the following:

set hive.merge.mapfiles=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
查看更多
登录 后发表回答