hadoop: difference between 0 reducer and identity

2019-01-08 14:09发布

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.

  • 0 reducer means reduce step will be skipped and mapper output will be the final out
  • Identity reducer means then shuffling/sorting will still take place?

4条回答
虎瘦雄心在
2楼-- · 2019-01-08 14:18

Another use-case for using the Identity Reducer is to combine all the results into <# of reducers> output files. This can be handy if you are using Amazon Web Services to write to S3 directly, especially if the mapper output is small (e.g. a grep/search for a record), and you have a lot of mappers (e.g. 1000's).

查看更多
Ridiculous、
3楼-- · 2019-01-08 14:27

It depends on your business requirements. If you are doing a wordcount you should reduce your map output to get a total result. If you just want to change the words to upper case, you don't need a reduce.

查看更多
The star\"
4楼-- · 2019-01-08 14:32

You understanding is correct. I would define it as following: If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.

查看更多
仙女界的扛把子
5楼-- · 2019-01-08 14:33

The main difference between "No Reducer" (mapred.reduce.tasks=0) and "Standard reducer" which is IdentityReducer (mapred.reduce.tasks=1 etc) is when you use "No reducer" there is no partitioning&shuffling processes after MAP stage. Therefore, in this case you will get 'pure' output from your mappers without any further processing. It helps for development and debugging puproses, but not only.

查看更多
登录 后发表回答