How does Hive 'alter table <table> concatenate

I have n(large) number of small sized orc files which i want to merge into k(small) number of large orc files.

This is done using alter table table_name concatenate command in Hive.

I want to understand how does Hive implement this. I'm looking to implement this using Spark with any changes if required.

Any pointers would be great.

As per the AlterTable/PartitionConcatenate:

If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.

Also ORC Stripes:

The body of ORC files consists of a series of stripes. Stripes are large (typically ~200MB) and independent of each other and are often processed by different tasks. The defining characteristic for columnar storage formats is that the data for each column is stored separately and that reading data out of the file should be proportional to the number of columns read. In ORC files, each column is stored in several streams that are stored next to each other in the file. For example, an integer column is represented as two streams PRESENT, which uses one with a bit per value recording if the value is non-null, and DATA, which records the non-null values. If all of a column's values in a stripe are non-null, the PRESENT stream is omitted from the stripe. For binary data, ORC uses three streams PRESENT, DATA, and LENGTH, which stores the length of each value. The details of each type will be presented in the following subsections.

For implementing in Spark you can use SparkSQL with the help of Spark Context:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

scala> sqlContext.sql("Your_hive_query_here")

问题:

回答1:

收藏的人(0)

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮