Small files and HDFS blocks

2楼-- · 2019-01-14 14:14

The main point need to understand in hdfs , file is partioned into blocks based on size and not that there will be some blocks in memory, where files are stored(this is misconception)

Basically multiple files are not stored in a single block(unless it is Archive or Har file).

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-01-14 14:15

Well you could do that using HAR (Hadoop Archive) filesystem which tries to pack multiple small files into HDFS block of special part file managed by HAR filesystem.

0人赞添加讨论(0) 举报

我命由我不由天

4楼-- · 2019-01-14 14:15

A block will store a single file. If your file is bigger that BlockSize(64/128/..) then it will be partitioned in multiple blocks with respective BlockSize.

0人赞添加讨论(0) 举报

劫难

5楼-- · 2019-01-14 14:20

Hadoop Block size is Hadoop Storage Concept. Every Time When you store a File in Hadoop it will divided into the block sizes and based on the replication factor and data locality it will be distributed over the cluster.

For Details:

When you Push a File on HDFS, it will be divided into blocks. Each Block is like a individual file having a maximum size as described by the block size.
Every block will contain a .meta file along with it, to store the metadata information of the block on Hadoop.
If the file is very small, then the whole file will be in one block and the block (a storage file) will have same size as file and a Meta File.

Some Commands:

Connect to any data Node on Your cluster [ if you have access ;)]. Then go to the storage directories for that node and you can see the actual blocks stored on the data node as below.

(Dir's are as per my cluster - /data2/dfs/dn/):

BLOCK Size: 1 GB

cd /data/dfs/dn -> current -> Finalized -> subDir0 -> (here is the Gold)

Block used only KB of storage for small files or might be when the file size is my blocksize + some KB's

-rw-r--r-- 1 hdfs hdfs 91K Sep 13 16:19 blk_1073781504

-rw-r--r-- 1 hdfs hdfs 19K Sep 13 16:21 blk_1073781504_40923.meta

When the File is Bigger then the block size the block will look like something as below

-rw-r--r-- 1 hdfs hdfs 1.0G Aug 31 12:03 blk_1073753814

-rw-r--r-- 1 hdfs hdfs 8.1M Aug 31 12:04 blk_1073753814_12994.meta

I hope it will explain the block storage stuff. If you want to know the detail how your files is stored in blocks then run

hdfs fsck -blocks -locations

Let me know if I missed out anything here.

0人赞添加讨论(0) 举报

何必那么认真

6楼-- · 2019-01-14 14:38

Multiple files are not stored in a single block. BTW, a single file can be stored in multiple blocks. The mapping between the file and the block-ids is persisted in the NameNode.

According to the Hadoop : The Definitive Guide

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.

HDFS is designed to handle large files. If there are too many small files then the NameNode might get loaded since it stores the name space for HDFS. Check this article on how to alleviate the problem with too many small files.

0人赞添加讨论(0) 举报

Small files and HDFS blocks

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间