Does a block in Hadoop Distributed File System store multiple small files, or a block stores only 1 file?
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- Java写文件至HDFS失败
- mapreduce count example
- Could you give me any clue Why 'Cannot call me
- Hive error: parseexception missing EOF
- Exception in thread “main” java.lang.NoClassDefFou
- ClassNotFoundException: org.apache.spark.SparkConf
- How can I configure the maven shade plugin to incl
- How was the container created and how does it work
The main point need to understand in hdfs ,
file is partioned into blocks based on size
and not that there will be some blocks in memory, where files are stored(this is misconception)Basically multiple files are not stored in a single block(unless it is Archive or Har file).
Well you could do that using HAR (Hadoop Archive) filesystem which tries to pack multiple small files into HDFS block of special part file managed by HAR filesystem.
A block will store a single file. If your file is bigger that BlockSize(64/128/..) then it will be partitioned in multiple blocks with respective BlockSize.
Hadoop Block size is Hadoop Storage Concept. Every Time When you store a File in Hadoop it will divided into the block sizes and based on the replication factor and data locality it will be distributed over the cluster.
For Details:
When you Push a File on HDFS, it will be divided into blocks. Each Block is like a individual file having a maximum size as described by the block size.
Every block will contain a .meta file along with it, to store the metadata information of the block on Hadoop.
If the file is very small, then the whole file will be in one block and the block (a storage file) will have same size as file and a Meta File.
Some Commands:
(Dir's are as per my cluster - /data2/dfs/dn/):
BLOCK Size: 1 GB
cd /data/dfs/dn -> current -> Finalized -> subDir0 -> (here is the Gold)
Block used only KB of storage for small files or might be when the file size is my blocksize + some KB's
-rw-r--r-- 1 hdfs hdfs 91K Sep 13 16:19 blk_1073781504
-rw-r--r-- 1 hdfs hdfs 19K Sep 13 16:21 blk_1073781504_40923.meta
When the File is Bigger then the block size the block will look like something as below
-rw-r--r-- 1 hdfs hdfs 1.0G Aug 31 12:03 blk_1073753814
-rw-r--r-- 1 hdfs hdfs 8.1M Aug 31 12:04 blk_1073753814_12994.meta
I hope it will explain the block storage stuff. If you want to know the detail how your files is stored in blocks then run
hdfs fsck -blocks -locations
Let me know if I missed out anything here.
Multiple files are not stored in a single block. BTW, a single file can be stored in multiple blocks. The mapping between the file and the block-ids is persisted in the NameNode.
According to the Hadoop : The Definitive Guide
HDFS is designed to handle large files. If there are too many small files then the NameNode might get loaded since it stores the name space for HDFS. Check this article on how to alleviate the problem with too many small files.