How does Hadoop perform input splits?

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

Thanks, Deepak

标签： hadoop mapreduce hdfs

11条回答

不美不萌又怎样

2楼-- · 2019-01-08 10:28

When a Hadoop job is run, it split input files into chunks and assign each split to a mapper to process; this is called InputSplit.

0人赞添加讨论(0) 举报

孤傲高冷的网名

3楼-- · 2019-01-08 10:28

FileInputFormat.addInputPath(job, new Path(args[ 0])); or

conf.setInputFormat(TextInputFormat.class);

class FileInputFormat funcation addInputPath ,setInputFormat take care of inputsplit, also this code defines the number of mappers get created. we can say inputsplit and number of mappers is directly proportion to number of blocks used for storing input file on HDFS.

Ex. if we have input file with size 74 Mb , this file stored on HDFS in two blocks (64 MB and 10 Mb). so inputsplit for this file is two and two mapper instances get created for reading this input file.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

4楼-- · 2019-01-08 10:33

Files are split into HDFS blocks and the blocks are replicated. Hadoop assigns a node for a split based on data locality principle. Hadoop will try to execute the mapper on the nodes where the block resides. Because of replication, there are multiple such nodes hosting the same block.

In case the nodes are not available, Hadoop will try to pick a node that is closest to the node that hosts the data block. It could pick another node in the same rack, for example. A node may not be available for various reasons; all the map slots may be under use or the node may simply be down.

0人赞添加讨论(0) 举报

时光不老，我们不散

5楼-- · 2019-01-08 10:37

FileInputFormat is the abstract class which defines how the input files are read and spilt up. FileInputFormat provides following functionalites: 1. select files/objects that should be used as input 2. Defines inputsplits that breaks a file into task.

As per hadoopp basic functionality, if there are n splits then there will be n mapper.

0人赞添加讨论(0) 举报

Deceive 欺骗

6楼-- · 2019-01-08 10:37

There is a seperate map reduce job that splits the files into blocks. Use FileInputFormat for large files and CombineFileInput Format for smaller ones. You can also check the whether the input can be split into blocks by issplittable method. Each block is then fed to a data node where a map reduce job runs for further analysis. the size of a block would depend on the size that you have mentioned in mapred.max.split.size parameter.

0人赞添加讨论(0) 举报

放我归山

7楼-- · 2019-01-08 10:42

The InputFormat is responsible to provide the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

0人赞添加讨论(0) 举报

1 2 下一页

How does Hadoop perform input splits?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间