Change File Split size in Hadoop

I have a bunch of small files in an HDFS directory. Although the volume of files are relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even several hours to be processed.

What I need to do, is to reduce the split size, so that I can utilize even more nodes for a job.

So the question is, how is it possible to split the files by let's say 10kb? Do I need to implement my own InputFormat and RecordReader for this, or is there any parameter to set? Thanks.

标签： java hadoop mapreduce distributed-computing

5条回答

闹够了就滚

2楼-- · 2019-01-08 10:35

Here is fragment which illustrates correct way to do what is needed here without magic configuration strings. Needed constant is defined inside FileInputFormat. Block size can be taken if needed from default HDFS block constant but it has pretty good probability to be user defined.

Here I just divide maximum split size by 2 if it was defined.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

// ....

final long DEFAULT_SPLIT_SIZE = 128 * 1024 * 1024;
final Configuration conf = ...

// We need to lower input block size by factor of two.
conf.setLong(
    FileInputFormat.SPLIT_MAXSIZE,
    conf.getLong(
        FileInputFormat.SPLIT_MAXSIZE, DEFAULT_SPLIT_SIZE) / 2);

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2019-01-08 10:49

"Hadoop: The Definitive Guide", p. 202:

Given a set of files, how does FileInputFormat turn them into splits? FileInputFormat splits only large files. Here “large” means larger than an HDFS block. The split size is normally the size of an HDFS block.

So you should change size of HDFS block, but this is wrong way. Maybe you should try to review architecture of your MapReduce application.

0人赞添加讨论(0) 举报

Explosion°爆炸

4楼-- · 2019-01-08 10:56

Write a custom input format which extends combinefileinputformat[has its own pros nad cons base don the hadoop distribution]. which combines the input splits into the value specified in mapred.max.split.size

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

5楼-- · 2019-01-08 10:57

The parameter mapred.max.split.size which can be set per job individually is what you looking for. Don't change dfs.block.size because this is global for HDFS and can lead to problems.

0人赞添加讨论(0) 举报

淡お忘

6楼-- · 2019-01-08 10:59

Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula:

max(minimumSize, min(maximumSize, blockSize))

by default

minimumSize < blockSize < maximumSize

so the split size is blockSize

For example,

Minimum Split Size 1
Maximum Split Size 32mb
Block Size  64mb
Split Size  32mb

Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.

0人赞添加讨论(0) 举报

Change File Split size in Hadoop

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间