HDFS storage support compression format to store compressed file. I know that gzip compression doesn't support splinting. Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. Now my question is:
- How this file will get stored in HDFS (Block size is 64MB)
From this link I came to know that The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks.
But I couldn't understand it completely and looking for broad explanation.
More doubts from gzip compressed file:
- How many block will be there for this 1GB gzip compressed file.
- Will it go on multiple datanode ?
- How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
- What is
DEFLATE
algorithm?
- Which algorithm is applied while reading the gzip compressed file?
I am looking here broad and detailed explanation.
How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?
All DFS blocks will be stored in single Datanode. If your block size is 64 MB and file is 1 GB, the Datanode
with 16 DFS blocks ( 1 GB / 64 MB = 15.625) will store 1 GB file.
How many block will be there for this 1GB gzip compressed file.
1 GB / 64 MB = 15.625 ~ 16 DFS blocks
How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
Same as of any other file. If the file is splittable, no change. If the file is not splittable, Datanodes with required number of blocks will be identified. In this case, 3 datanodes with 16 available DFS blocks.
From source code of this link : http://grepcode.com/file_/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java/?v=source
and
http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/0.22.0/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java/?v=source
/** The class is responsible for choosing the desired number of targets
* for placing block replicas.
* The replica placement strategy is that if the writer is on a datanode,
* the 1st replica is placed on the local machine,
* otherwise a random datanode. The 2nd replica is placed on a datanode
* that is on a different rack. The 3rd replica is placed on a datanode
* which is on the same rack as the first replca.
*/
What is DEFLATE algorithm?
DELATE is the algorithm to uncompress zipped files of GZIP format.
Have a look at this slide to have understanding of other algorithms for different variants of zip files.
Have a look at this presentation for more details.