If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how?
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- Java写文件至HDFS失败
- Livy Server: return a dataframe as JSON?
- mapreduce count example
- Cloudera 5.6: Parquet does not support date. See H
- Sorting a data stream before writing to file in no
- Could you give me any clue Why 'Cannot call me
- Hive error: parseexception missing EOF
- Exception in thread “main” java.lang.NoClassDefFou
You can change the block size any time unless
dfs.blocksize
parameter is defined as final in hdfs-site.xml.To change block size
hadoop fs
command you can runhadoop fs -Ddfs.blocksize=67108864 -put <local_file> <hdfs_path>
. This command will save file with 64MB block sizehadoop jar
command -hadoop jar <jar_file> <class> -Ddfs.blocksize=<desired_block_size> <other_args>
. Reducer will use the defined block size while storing the output in HDFSCriteria for changing block size:
I have covered most of this in 2 and 3 of this performance tuning playlist.
There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented:
HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is stored as an actual file on a datanode. In fact the same file is stored on several datanodes, according to the replication factor. The blocksize of these individual files and their other performance characteristics in turn depend on the underlying filesystems of the individual datanodes.
The mapping between an HDFS-File and the individual files on the datanodes is maintained by the namenode. But the namenode doesn't expect a specific blocksize, it just stores the mappings which where created during the creation of the HDFS file, which is usually split according to the default
dfs.blocksize
(but can be individually overwritten).This means for example if you have 1 MB file with a replication of 3 and a blocksize of 64 MB, you don't lose 63 MB * 3 = 189 MB, since physically just three 1 MB files are stored with the standard blocksize of the underlying filesystems (e.g. ext4).
So the question becomes what a good
dfs.blocksize
is and if it's advisable to change it. Let me first list the aspects speaking for a bigger blocksize:Bigger default splits: even though the splitsize can be configured on Job level, most people don't consider this and just go with the default which is usually the blocksize. If your splitsize is too small though, you can end up with too many mappers which don't have much work to do which in turn can lead to even smaller output files, unnecessary overhead and many occupied containers which can starve other jobs. This also has an adverse affect on the reduce phase, since the results must be fetched from all mappers.
Of course the ideal splitsize heavily depends on the kind of work you've to do. But you always can set a lower splitsize when necessary, whereas when you set a higher splitsize than the blocksize you might lose some data locality.
The latter aspect is less of an issue than one would think though, because the rule for block placement in HDFS is: the first block is written on the datanode where the process creating the file runs, the second one on another node in the same rack and the third one on a node on another rack. So usually one replica for each block of a file can be found on a single datanode, so data locality can still be achieved even when one mapper is reading several blocks due to a splitsize which is a multiple of the blocksize. Still in this case the mapred framework can only select one node instead of the usual three to achieve data locality so an effect can't be denied.
But ultimately this point for a bigger blocksize is probably the weakest of all, since one can set the splitsize independently if necessary.
But there also have to be arguments for a smaller blocksize otherwise we should just set it to infinity…
Weighting these factors against each other depends on your kind of data, cluster, workload etc. But in general I think the default blocksize 128 MB is already a little low for typical usecases. 512 MB or even 1 GB might be worth considering.
But before you even dig into that you should first check the size of your input files. If most of your files are small and don't even reach the max default blocksize your blocksize is basically always the filesize and it wouldn't help anything to increase the default blocksize. There are workarounds like using an input combiner to avoid spawning too many mappers, but ultimately you need to ensure your input files are big enough to take advantage of a big blocksize.
And if your files are already small don't compound the problem by making the blocksize even smaller.
It depends on the input data. The number of mappers is directly proportional to input splits,which depend on DFS block size.
If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best.
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
For smaller files, using a smaller block size is better.
Have a look at this article
If you have small files and which are less than minimum DFS block size, you can use some alternatives like HAR or SequenceFiles.
Have a look at this cloudera blog