可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

回答1:

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

回答2:

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro

Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns then you can use Parquet

HBase is useful when you there is frequent updating on data is involved. Avro is fast in retrieval, Parquet is much faster.

回答3:

Avro

Widely used as a serialization platform
Row-based, offers a compact and fast binary format
Schema is encoded on the file so the data can be untagged
Files support block compression and are splittable
Supports schema evolution

Parquet

Column-oriented binary file format
Uses the record shredding and assembly algorithm described in the Dremel paper
Each data file contains the values for a set of rows
Efficient in terms of disk I/O when specific columns need to be queried

From Choosing an HDFS data storage format- Avro vs. Parquet and more

回答4:

Use of both formats depends on the use case. On the basis of 3 factors, we can choose which format will be opt in our case:

Read/Write operation: Parquet is column based file format hence support indexing because of which it is suitable for read intensive, complex or analytical querying, low latency data. This is generally used by end users/Data scientists. Whereas, AVRO, being a row based file format, is best fit for write intensive operation. This is generally used by Data Engineers. Both supports serialization and compression formats.
Tools: Parquet is best fit for Impala (have MPP engine) as it is responsible for complex/interactive querying and low latency outputs. This is supported by CDH. Like this HDP supports ORC formats (selections also depends on the hadoop distribution). Whereas, Avro is best suitable for Spark processing.
Schema Evolution: means changing schema of data over the transformation and processing. Both Parquet and Avro supports schema evolution but at certain degree. Comparatively, Avro provide much richer Schema evolution. Parquet is good when we have some append operations like addition of columns but Avro is suitable for both appending and modification operations. Here Avro shines better compared to Parquet.

回答5:

Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.

We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.

回答6:

Silver Blaze put description nicely with an example use case and described how Parquet was the best choice for him. It makes sense to consider one over the other depending on your requirements. I am putting up a brief description of different other file formats too along with time space complexity comparison. Hope that helps.

There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.

This Blog Post

This link from MapR [They don't discuss Parquet though]

This link from Inquidia

The above given links will get you going. I hope this answer your query.

Thanks!