Where Does the HDFS Account for Triple Replication

2019-04-11 00:42发布

In the latest version of most Hadoop distributions, the HDFS usage reports seem to report on space without accounting for the replication factor, correct?

When one looks at the Namenode Web UI and/or runs the 'hadoop dfsadmin -report' command, one can see a report that looks something like this:

Configured Capacity: 247699161084 (230.69 GB)
Present Capacity: 233972113408 (217.9 GB)
DFS Remaining: 162082414592 (150.95 GB)
DFS Used: 71889698816 (66.95 GB)
DFS Used%: 30.73%
Under replicated blocks: 40
Blocks with corrupt replicas: 6
Missing blocks: 0

Based on the machine sizes of this cluster, it seems that this report does NOT account for triple replication... I.E. If I place a file on the HDFS, I should account for the triple replication myself.

For example, if I placed a 50GB file on the HDFS, would my HDFS be dangerously close to full (since it seems that file would be replicated 3 times, using up the 150GB that currently remain)?

标签: hadoop size hdfs
2条回答
做个烂人
2楼-- · 2019-04-11 00:59

dfsadmin report does consider replication. If you want the pre-replication used bytes, use:

hdfs dfs -du -s /
查看更多
啃猪蹄的小仙女
3楼-- · 2019-04-11 01:19

Let us define clearly what each of these terms mean.

  1. Configured Capacity: It is the total capacity available to HDFS for Storage. So if you have 4 nodes and each node has 50 GB capacity, the configured capacity will be 200 GB. Replication factor is irrelevant in case of configured capacity.

  2. DFS Used: This is the amount of storage space that has been used up by HDFS. Divide DFS Used by your replication factor to get the actual size of your files stored without replication. So if your DFS used is 60 GB, and your replication factor is 3, the actual size of your files is 60/3 = 20 GB.

  3. DFS Remaining: This is the amount of storage space still available to the HDFS. If you have 150 GB remaining storage space, that mean you can store upto 150/3 = 50 GB of files without exceeding your Configured Capacity (assuming replication factor = 3).

  4. Present Capacity: The amount of storage space available for storing user files after allocating space for metadata. The difference:(Configured capacity - Present capacity) is used for storing file system metadata. and inode information.

Hope this clears it up.

查看更多
登录 后发表回答