Hive gzip file decompression

2019-05-07 06:28发布

问题:

I have loaded bunch of .gz file into HDFS and when I create a raw table on top of them I am seeing strange behavior when counting number of rows. Comparing the result of the count(*) from the gz table versus the uncompressed table results in ~85% difference. The table that has the file gz compressed has less records. Has anyone seen this?

CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
  col1 string, col2 string, col3 string)
ROW FORMAT DELIMITED
   LINES TERMINATED BY '\n'
LOCATION '/data/raw/test_gz'
;

select count(*) from test_gz;    result 1,123,456
select count(*) from test;  result 7,720,109

回答1:

I was able to resolve this issue. Somehow the gzip files were not fully getting decompressed in map/reduce jobs (hive or custom java map/reduce). Mapreduce job would only read about ~450 MB of the gzip file and write out the data out to HDFS without fully reading the 3.5GZ file. Strange, no errors at all!

Since the files were compressed on another server, I decompressed them manually and re-compressed them on the hadoop client server. After that, I uploaded the newly compressed 3.5GZ file to HDFS, and then hive was able to fully count all the records reading the whole file.

Marcin



标签: hadoop gzip hive