Tips to improve MapReduce Job performance in Hadoo

2019-01-27 09:30发布

I have 100 mapper and 1 reducer running in a job. How to improve the job performance?

As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?

1条回答
对你真心纯属浪费
2楼-- · 2019-01-27 09:55

With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.

But there are some general guidelines to improve the performance.

  1. If each task takes less than 30-40 seconds, reduce the number of tasks
  2. If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
  3. So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
  4. Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.

Some more tips :

  1. Configure the cluster properly with right diagnostic tools
  2. Use compression when you are writing intermediate data to disk
  3. Tune number of Map & Reduce tasks as per above tips
  4. Incorporate Combiner wherever it is appropriate
  5. Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
  6. Reuse Writables
  7. Have right profiling tools

Have a look at this cloudera article for some more tips.

查看更多
登录 后发表回答