Hadoop dfs replicate

2019-02-10 08:27发布

问题:

Sorry guys,just a simple question but I cannot find exact question on google. The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file. shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt? Thanks in Advance

回答1:

The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.

Check the Apache Documentation for the other configuration parameters.



回答2:

To ensure high availability of data, Hadoop replicates the data.

When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.

The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.

Hope this helps.



回答3:

The link provided by Praveen is now broken. Here is the updated link describing the parameter dfs.replication.

Refer Hadoop Cluster Setup. for more information on configuration parameters.

You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.



回答4:

In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.

Hope you got some clarity.



标签: hadoop hdfs