Small files and HDFS blocks

2019-01-14 13:51发布

Does a block in Hadoop Distributed File System store multiple small files, or a block stores only 1 file?

标签: hadoop hdfs
5条回答
别忘想泡老子
2楼-- · 2019-01-14 14:14

The main point need to understand in hdfs , file is partioned into blocks based on size and not that there will be some blocks in memory, where files are stored(this is misconception)

Basically multiple files are not stored in a single block(unless it is Archive or Har file).

查看更多
混吃等死
3楼-- · 2019-01-14 14:15

Well you could do that using HAR (Hadoop Archive) filesystem which tries to pack multiple small files into HDFS block of special part file managed by HAR filesystem.

查看更多
我命由我不由天
4楼-- · 2019-01-14 14:15

A block will store a single file. If your file is bigger that BlockSize(64/128/..) then it will be partitioned in multiple blocks with respective BlockSize.

查看更多
劫难
5楼-- · 2019-01-14 14:20

Hadoop Block size is Hadoop Storage Concept. Every Time When you store a File in Hadoop it will divided into the block sizes and based on the replication factor and data locality it will be distributed over the cluster.

For Details:

  • When you Push a File on HDFS, it will be divided into blocks. Each Block is like a individual file having a maximum size as described by the block size.

  • Every block will contain a .meta file along with it, to store the metadata information of the block on Hadoop.

  • If the file is very small, then the whole file will be in one block and the block (a storage file) will have same size as file and a Meta File.

Some Commands:

  • Connect to any data Node on Your cluster [ if you have access ;)]. Then go to the storage directories for that node and you can see the actual blocks stored on the data node as below.

(Dir's are as per my cluster - /data2/dfs/dn/):

BLOCK Size: 1 GB

cd /data/dfs/dn -> current -> Finalized -> subDir0 -> (here is the Gold)

Block used only KB of storage for small files or might be when the file size is my blocksize + some KB's

-rw-r--r-- 1 hdfs hdfs 91K Sep 13 16:19 blk_1073781504

-rw-r--r-- 1 hdfs hdfs 19K Sep 13 16:21 blk_1073781504_40923.meta

When the File is Bigger then the block size the block will look like something as below

-rw-r--r-- 1 hdfs hdfs 1.0G Aug 31 12:03 blk_1073753814

-rw-r--r-- 1 hdfs hdfs 8.1M Aug 31 12:04 blk_1073753814_12994.meta

I hope it will explain the block storage stuff. If you want to know the detail how your files is stored in blocks then run

hdfs fsck -blocks -locations

Let me know if I missed out anything here.

查看更多
何必那么认真
6楼-- · 2019-01-14 14:38

Multiple files are not stored in a single block. BTW, a single file can be stored in multiple blocks. The mapping between the file and the block-ids is persisted in the NameNode.

According to the Hadoop : The Definitive Guide

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.

HDFS is designed to handle large files. If there are too many small files then the NameNode might get loaded since it stores the name space for HDFS. Check this article on how to alleviate the problem with too many small files.

查看更多
登录 后发表回答