How does Hive 'alter table concatenate

2020-07-22 19:42发布

问题:

I have n(large) number of small sized orc files which i want to merge into k(small) number of large orc files.

This is done using alter table table_name concatenate command in Hive.

I want to understand how does Hive implement this. I'm looking to implement this using Spark with any changes if required.

Any pointers would be great.

回答1:

As per the AlterTable/PartitionConcatenate:

If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.

Also ORC Stripes:

The body of ORC files consists of a series of stripes. Stripes are large (typically ~200MB) and independent of each other and are often processed by different tasks. The defining characteristic for columnar storage formats is that the data for each column is stored separately and that reading data out of the file should be proportional to the number of columns read. In ORC files, each column is stored in several streams that are stored next to each other in the file. For example, an integer column is represented as two streams PRESENT, which uses one with a bit per value recording if the value is non-null, and DATA, which records the non-null values. If all of a column's values in a stripe are non-null, the PRESENT stream is omitted from the stripe. For binary data, ORC uses three streams PRESENT, DATA, and LENGTH, which stores the length of each value. The details of each type will be presented in the following subsections.

For implementing in Spark you can use SparkSQL with the help of Spark Context:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

scala> sqlContext.sql("Your_hive_query_here")