How to set data block size in Hadoop ? Is it advan

2020-03-01 02:20发布

If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how?

3条回答
闹够了就滚
2楼-- · 2020-03-01 02:48

You can change the block size any time unless dfs.blocksize parameter is defined as final in hdfs-site.xml.

To change block size

  1. while running hadoop fs command you can run hadoop fs -Ddfs.blocksize=67108864 -put <local_file> <hdfs_path>. This command will save file with 64MB block size
  2. while running hadoop jar command - hadoop jar <jar_file> <class> -Ddfs.blocksize=<desired_block_size> <other_args>. Reducer will use the defined block size while storing the output in HDFS
  3. as part of the map reduce program, you can use job.set and set the value

Criteria for changing block size:

  1. Typically 128 MB for uncompressed files works well
  2. You can consider reducing block size on compressed files. If the compression rate is too high then having higher block size might slow down the processing. If the compression codec is not splittable, it will aggravate the issue.
  3. As long as the file size is more than block size, you need not change the block size. If the number of mappers to process the data is very high, you can reduce number of mappers by increasing the split size. For example if you have 1TB of data with 128 MB block size, then by default it will take 8000 mappers. Instead of changing the block size you can consider changing the split size to 512 MB or even 1 GB and it will take far fewer number of mappers to process the data.

I have covered most of this in 2 and 3 of this performance tuning playlist.

查看更多
Viruses.
3楼-- · 2020-03-01 02:49

There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented:

HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is stored as an actual file on a datanode. In fact the same file is stored on several datanodes, according to the replication factor. The blocksize of these individual files and their other performance characteristics in turn depend on the underlying filesystems of the individual datanodes.

The mapping between an HDFS-File and the individual files on the datanodes is maintained by the namenode. But the namenode doesn't expect a specific blocksize, it just stores the mappings which where created during the creation of the HDFS file, which is usually split according to the default dfs.blocksize (but can be individually overwritten).

This means for example if you have 1 MB file with a replication of 3 and a blocksize of 64 MB, you don't lose 63 MB * 3 = 189 MB, since physically just three 1 MB files are stored with the standard blocksize of the underlying filesystems (e.g. ext4).

So the question becomes what a good dfs.blocksize is and if it's advisable to change it. Let me first list the aspects speaking for a bigger blocksize:

  1. Namenode pressure: As mentioned the namenode has to maintain the mappings between dfs files and their blocks to physical files on datanodes. So the less blocks/file the less memory pressure and communication overhead it has
  2. Disk throughput: Files are written by a single process in hadoop, which usually results in data written sequentially to disk. This is especially advantageous for rotational disks because it avoids costly seeks. If the data is written that way, it can also be read that way so it becomes an advantage for reads and writes. In fact this optimization in combination with data locally (i.e. do the processing where the data is) is one of the main ideas of mapreduce.
  3. Network throughput: Data locality is the more important optimization, but in a distributed system this can not always be achieved, so sometimes it's necessary to copy data between nodes. Normally one file (dfs block) is transferred via one persistent TCP connection which can reach a higher throughput when big files are transferred.
  4. Bigger default splits: even though the splitsize can be configured on Job level, most people don't consider this and just go with the default which is usually the blocksize. If your splitsize is too small though, you can end up with too many mappers which don't have much work to do which in turn can lead to even smaller output files, unnecessary overhead and many occupied containers which can starve other jobs. This also has an adverse affect on the reduce phase, since the results must be fetched from all mappers.

    Of course the ideal splitsize heavily depends on the kind of work you've to do. But you always can set a lower splitsize when necessary, whereas when you set a higher splitsize than the blocksize you might lose some data locality.

    The latter aspect is less of an issue than one would think though, because the rule for block placement in HDFS is: the first block is written on the datanode where the process creating the file runs, the second one on another node in the same rack and the third one on a node on another rack. So usually one replica for each block of a file can be found on a single datanode, so data locality can still be achieved even when one mapper is reading several blocks due to a splitsize which is a multiple of the blocksize. Still in this case the mapred framework can only select one node instead of the usual three to achieve data locality so an effect can't be denied.

    But ultimately this point for a bigger blocksize is probably the weakest of all, since one can set the splitsize independently if necessary.

But there also have to be arguments for a smaller blocksize otherwise we should just set it to infinity…

  1. Parallelism/Distribution: If your input data lies on just a few nodes even a big cluster doesn't help to achieve parallel processing, at least if you want to maintain some data locality. As a rule I would say a good blocksize should match what you also can accept as a splitsize for your default workload.
  2. Fault tolerance and latency: If a network connection breaks the perturbation of retransmitting a smaller file is less. TCP throughput might be important but individual connections shouldn't take forever either.

Weighting these factors against each other depends on your kind of data, cluster, workload etc. But in general I think the default blocksize 128 MB is already a little low for typical usecases. 512 MB or even 1 GB might be worth considering.

But before you even dig into that you should first check the size of your input files. If most of your files are small and don't even reach the max default blocksize your blocksize is basically always the filesize and it wouldn't help anything to increase the default blocksize. There are workarounds like using an input combiner to avoid spawning too many mappers, but ultimately you need to ensure your input files are big enough to take advantage of a big blocksize.

And if your files are already small don't compound the problem by making the blocksize even smaller.

查看更多
成全新的幸福
4楼-- · 2020-03-01 02:50

It depends on the input data. The number of mappers is directly proportional to input splits,which depend on DFS block size.

If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best.

If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.

For smaller files, using a smaller block size is better.

Have a look at this article

If you have small files and which are less than minimum DFS block size, you can use some alternatives like HAR or SequenceFiles.

Have a look at this cloudera blog

查看更多
登录 后发表回答