I have loaded bunch of .gz file into HDFS and when I create a raw table on top of them I am seeing strange behavior when counting number of rows. Comparing the result of the count(*) from the gz table versus the uncompressed table results in ~85% difference. The table that has the file gz compressed has less records. Has anyone seen this?
CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
col1 string, col2 string, col3 string)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION '/data/raw/test_gz'
;
select count(*) from test_gz; result 1,123,456
select count(*) from test; result 7,720,109