Storage format in HDFS

2019-03-30 13:03发布

How Does HDFS store data?

I want to store huge files in a compressed fashion.

E.g : I have a 1.5 GB of file, with default replication factor of 3.

It requires (1.5)*3 = 4.5 GB of space.

I believe currently no implicit compression of data takes place.

Is there a technique to compress the file and store it in HDFS to save disk space ?

4条回答
成全新的幸福
2楼-- · 2019-03-30 13:17

The answer to this question is to first understand the file format available in Hadoop today. There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.

A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.

In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.

Why is it important to know these formats

In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.

Choosing an appropriate file format can have some significant benefits:

  1. Optimum read time
  2. Optimum write time
  3. Spliting or partitioning of files (so you don’t need to read the whole file, just a part of it)
  4. Schema adaption (allowing a field changes to a dataset) Compression support (without sacrificing these features)

Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.

Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.

Some common storage formats for Hadoop include:

Plain text storage (eg, CSV, TSV files, Delimited file etc)

Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.

Sequence Files

Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.

Avro

The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.

Parquet

My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.

If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.

ORC

ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.

It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.

查看更多
Luminary・发光体
3楼-- · 2019-03-30 13:33

Some time ago I tried to summarize that in a blog post here. Essentially that is a question of data splittability, as a file is devided into blocks which are elementary blocks for replication. Name node is responsible for keeping track of all those blocks belonging to one file. It is essential that block is autonomous when choosing compression - not all codecs are splittable. If the format + codec is not splittable that means that in order to decompress it it needs to be in one place which has big impact on parallelism in mapreduce. Essentially running in single slot. Hope that helps.

查看更多
SAY GOODBYE
4楼-- · 2019-03-30 13:36

Have a look at presentation @ Hadoop_Summit, especially Slide 6 and Slide 7.

enter image description here

enter image description here

  1. If DFS block size is 128 MB, for 4.5 GB storage (including replication factor of 3), you need 35.15 ( ~36 blocks)
  2. Only bzip2 file format is splittable. In other formats, all blocks of entire files are stored in same Datanode
  3. Have a look at algorithm types and class names and codecs
  4. @Chris White answer provides information on how to enable zipping while writing Map output
查看更多
闹够了就滚
5楼-- · 2019-03-30 13:39

HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)

So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.

If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.

However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).

Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.

In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.

Hope this makes sense

EDIT Some additional info in response to comments:

When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:

  • setCompressOutput(Job, boolean)
  • setOutputCompressorClass(Job, Class)

When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)

查看更多
登录 后发表回答