Are multiple files stored in a single block?

2020-02-20 05:35发布

问题:

When I store many small files into HDFS, will they get stored in a single block?

In my opinion, these small files should get stored into a single block according to this discussion: HDFS block size Vs actual file size

回答1:

Quoting from Hadoop - The Definitive Guide:

HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode. (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.

Conclusion: Each file will get stored in a separate block.



回答2:

Below is what specified in Hadoop Definitive Guide:

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage

For example, if you have 30 MB file and your block size is of 64 MB, then this file will get stored in one block logically, but in the physical file system, HDFS uses only 30 MB to store the file. The remaining 30 MB will be free to use.



回答3:

each block belongs to only one file,just do like below: 1.use fsck command to get block info of file:

hadoop fsck /gavial/data/OB/AIR/PM25/201709/01/15_00.json -files -blocks

out put like this:

    /gavial/data/OB/AIR/PM25/201709/01/15_00.json 521340 bytes, 1 block(s):  OK
0. BP-1004679263-192.168.130.151-1485326068364:blk_1074920015_1179253 len=521340 repl=3

Status: HEALTHY
 Total size:    521340 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  1 (avg. block size 521340 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)

block id is

blk_1074920015

2.use fsck command to show block status,out put like this

hdfs fsck -blockId blk_1074920015

Block Id: blk_1074920015
Block belongs to: /gavial/data/OB/AIR/PM25/201709/01/15_00.json
No. of Expected Replica: 3
No. of live Replica: 3
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommission Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-5/default-rack is HEALTHY
Block replica on datanode/rack: datanode-1/default-rack is HEALTHY

obviously,the block belongs to only one file



回答4:

Yes. when you store large number of small files, they get stored in a single block until the block has equal space to accommodate. But the inefficiency comes because for each of these small files, there will be an indexing entry(filename,block,offset) gets created in the namenode for each small file. This wastes up the memory reserved for metadata in the namenode if we have many small files instead of small number of very large files.



标签: hadoop hdfs