On Hive 2.2.0, I am filling an orc table from another source table of size 1.34 GB, using the query
INSERT INTO TABLE TableOrc SELECT * FROM Table; ---- (1)
The query creates TableORC table with 6 orc files, which are much smaller than the block size of 256MB.
-- FolderList1
-rwxr-xr-x user1 supergroup 65.01 MB 1/1/2016, 10:14:21 AM 1 256 MB 000000_0
-rwxr-xr-x user1 supergroup 67.48 MB 1/1/2016, 10:14:55 AM 1 256 MB 000001_0
-rwxr-xr-x user1 supergroup 66.3 MB 1/1/2016, 10:15:18 AM 1 256 MB 000002_0
-rwxr-xr-x user1 supergroup 63.83 MB 1/1/2016, 10:15:41 AM 1 256 MB 000003_0
-rwxr-xr-x user1 supergroup 69.11 MB 1/1/2016, 10:15:57 AM 1 256 MB 000004_0
-rwxr-xr-x user1 supergroup 23.83 MB 1/1/2016, 10:16:02 AM 1 256 MB 000005_0
To potentially remove this, the configurations below is used as suggested by multiple other stackoverflow users and TableORC is truncated and query (1) is executed again.
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=128000000; -- (128MB)
SET hive.merge.smallfiles.avgsize=128000000; -- (128MB)
This time the result is a TableOrc having 3 ORC files and LOGS show 3 additional MR jobs
-- FolderList2
-rwxr-xr-x user1 supergroup 132.49 MB 1/1/2016, 11:47:01 PM 1 256 MB 000000_0
-rwxr-xr-x user1 supergroup 130.12 MB 1/1/2016, 11:47:02 PM 1 256 MB 000001_0
-rwxr-xr-x user1 supergroup 92.93 MB 1/1/2016, 11:47:03 PM 1 256 MB 000002_0
My question is why are the ORC file sizes greater than 128MB when the configuration settings are set to use 128MB (https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)
hive.merge.size.per.task
Default Value: 256000000
Added In: Hive 0.4.0
Size of merged files at the end of the job.