hadoop - how total mappers are determined

2019-07-29 11:10发布

I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.

Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.

How is the total mappers determined as 28?

I added the below line into my wordcount.java program to check.

FileInputFormat.setMaxInputSplitSize(job, 2);

Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.

row1,row2,row3,row4,row5,row6.......row20

Should I split the input file into 20 different files each having only 2 rows?

2条回答
劳资没心,怎么记你
2楼-- · 2019-07-29 11:35

HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.

Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.

Does this sound OK?

查看更多
贪生不怕死
3楼-- · 2019-07-29 11:47

That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.

Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?

查看更多
登录 后发表回答