Multiple parts created while inserting in Hive tab

2019-07-18 08:05发布

问题:

I have a hive table (with compression) with definition like

create table temp1 (col1 string, col2 int)
partitioned by (col3 string, col4 string) 
row format delimited 
fields terminated by ',' 
escaped by '\\' 
lines terminated by '\n'
stored as sequencefile;

When I do a simple select and insert (no reducers running) from another hive table to this table i see a unique pattern, data in this table with compression gets split in high no of files of very small size (table 1 : at times 1gb data gets split over 200-300 files thus increasing the no of blocks consumed though it should have spanned only 16blocks) due to this very high no of maps are formed when I query this new table. File size does not go beyond 245mb (table 2). Is there a setting to restrict this to 64mb (or multiple of 64mb or just a single file) as my block size is 64 mb and hence excess blocks will not get created.

TABLE 1

Name | Type | Size | Block Size

000000_0 | file | 30.22MB | 64 MB

000001_0 | file | 26.19MB | 64 MB

000002_0 | file | 25.19MB | 64 MB

000003_0 | file | 24.74MB | 64 MB

000004_0 | file | 24.54MB | 64 MB

..........

000031_0 | file | 0.9MB | 64 MB

TABLE 2

Name | Type | Size | Block Size

000000_0 | file | 245.02MB | 64 MB

000001_0 | file | 245.01MB | 64 MB

000002_0 | file | 244.53MB | 64 MB

000003_0 | file | 244.4MB | 64 MB

000004_0 | file | 198.21MB | 64 MB

回答1:

have solved the issue thanks to Bryan's who highlighted hive variables that control the query output format. I tested setting following hive variables in a session: set hive.merge.mapredfiles=true set hive.merge.size.per.task=256000000 set hive.merge.smallfiles.avgsize=256000000

So now inside a partition I am getting compressed files of size ~ 256mb. To permanently set these variables create a .hiverc file with the same statements in home directory of that user.

Hope this helps



回答2:

The key part of the create table query that causes this is the partitioned by. This splits up the data into files and directories for each combination of the columns used. This allows for querying quickly on sets of them, but can create a large quantity of mappers when querying the entire table.

I suggest you either reconsidering your usage of partitioned by or deal with the consequences of it.

Does this usage of a large amount of blocks hurt you tremendously? As long as you are not creating millions of files/directories way below the block size you shouldn't have many issues with your NameNode handling them.

As for the file sizes take a look at this: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

Pay special attention to hive.merge.mapfiles, hive.merge.mapredfiles, hive.merge.smallfiles.avgsize, hive.merge.size.per.task