I have a hive table (with compression) with definition like
create table temp1 (col1 string, col2 int)
partitioned by (col3 string, col4 string)
row format delimited
fields terminated by ','
escaped by '\\'
lines terminated by '\n'
stored as sequencefile;
When I do a simple select and insert (no reducers running) from another hive table to this table i see a unique pattern, data in this table with compression gets split in high no of files of very small size (table 1 : at times 1gb data gets split over 200-300 files thus increasing the no of blocks consumed though it should have spanned only 16blocks) due to this very high no of maps are formed when I query this new table. File size does not go beyond 245mb (table 2). Is there a setting to restrict this to 64mb (or multiple of 64mb or just a single file) as my block size is 64 mb and hence excess blocks will not get created.
TABLE 1
Name | Type | Size | Block Size
000000_0 | file | 30.22MB | 64 MB
000001_0 | file | 26.19MB | 64 MB
000002_0 | file | 25.19MB | 64 MB
000003_0 | file | 24.74MB | 64 MB
000004_0 | file | 24.54MB | 64 MB
..........
000031_0 | file | 0.9MB | 64 MB
TABLE 2
Name | Type | Size | Block Size
000000_0 | file | 245.02MB | 64 MB
000001_0 | file | 245.01MB | 64 MB
000002_0 | file | 244.53MB | 64 MB
000003_0 | file | 244.4MB | 64 MB
000004_0 | file | 198.21MB | 64 MB
The key part of the create table query that causes this is the
partitioned by
. This splits up the data into files and directories for each combination of the columns used. This allows for querying quickly on sets of them, but can create a large quantity of mappers when querying the entire table.I suggest you either reconsidering your usage of
partitioned by
or deal with the consequences of it.Does this usage of a large amount of blocks hurt you tremendously? As long as you are not creating millions of files/directories way below the block size you shouldn't have many issues with your NameNode handling them.
As for the file sizes take a look at this: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
Pay special attention to
hive.merge.mapfiles
,hive.merge.mapredfiles
,hive.merge.smallfiles.avgsize
,hive.merge.size.per.task
have solved the issue thanks to Bryan's who highlighted hive variables that control the query output format. I tested setting following hive variables in a session:
set hive.merge.mapredfiles=true
set hive.merge.size.per.task=256000000
set hive.merge.smallfiles.avgsize=256000000
So now inside a partition I am getting compressed files of size ~ 256mb. To permanently set these variables create a .hiverc file with the same statements in home directory of that user.
Hope this helps