How to increase hive concurrent mappers to more th

2019-04-12 22:50发布

问题:

Summary

When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.

Details

I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).

I have a table of several billion rows. When I perform a simple

select count(*) from mytable;

hive creates hundreds of map tasks, but only 4 are running simultaneously.

This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.

We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).

The memory settings are as follows:

yarn.nodemanager.resource.memory-mb     188928  MB
yarn.scheduler.minimum-allocation-mb    20992   MB
yarn.scheduler.maximum-allocation-mb    188928  MB
yarn.app.mapreduce.am.resource.mb       20992   MB
mapreduce.map.memory.mb                 20992   MB
mapreduce.reduce.memory.mb              20992   MB

and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.

Vcore settings:

yarn.nodemanager.resource.cpu-vcores       24
yarn.scheduler.minimum-allocation-vcores   1
yarn.scheduler.maximum-allocation-vcores   24
yarn.app.mapreduce.am.resource.cpu-vcores  1
mapreduce.map.cpu.vcores                   1
mapreduce.reduce.cpu.vcores                1
  • Is there are way to get hive/mapreduce to use more of my cluster?
  • How would a go about figuring out the bottle neck?
  • Could it be that Yarn is not assigning tasks fast enough?

I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).

回答1:

Running parallel tasks depends on your memory setting in yarn for example if you have 4 data nodes and your yarn memory properties are defined as below

yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb    1 GB
yarn.scheduler.maximum-allocation-mb    1 GB
yarn.app.mapreduce.am.resource.mb   1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb  1 GB

according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory

so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task

P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also