How to determine file size in HDFS using Hive

2019-07-13 15:56发布

The workspace i am using is set with Hive 1.1.0 and CDH 5.5.4. I make a query which brings a 22 partitions result. The files saved in this partitions directories are always unique, and can variate from 20MB to 700MB.

From what i understood, this is related to the number of reducers used in the process of the query. Let´s assume i want to have 5 files for each partition instead of 1, i use this command:

set mapreduce.job.reduces=5;

This will make the system use 5 reduce tasks in stage 1, but will automatically switch to 1 reducer at stage 2 (determined automatically at compile time). From what i read, this is due to compiler having more importance than configuration at the time of choosing the number of reducers. It seems that some tasks can not be 'paralelized' and can only be done by one process or reducer task, so system will automatically determine it.

Code :

insert into table core.pae_ind1 partition (project,ut,year,month)
select ts,date_time, if(
-- m1
code_ac_dcu_m1_d1=0
and (min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts,NULL) as 
t_open_dcu_m1_d1,

if( code_ac_dcu_m1_d1=2
and (min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts, NULL) as 
t_close_dcu_m1_d1,
project,ut,year,month

from core.pae_open_close
where ut='902'
order by ut,ts

This leads to having huge files at the end. I would like to know if there is a way of splitting this result files into smaller ones (preferably limiting them by size).

标签: hadoop hive hdfs
1条回答
放我归山
2楼-- · 2019-07-13 16:32

As @DuduMarkovitz pointed, your code contains instruction to order globally the dataset. This will run on single reducer. You better order during select from your table. Even if your files are in order after such insert and they are splittable - they will be read on many mappers then the result will be not in order due to parallelism and you will need to order. Just get rid of this order by ut,ts in the insert and use these configuration settings for controlling the number of reducers:

set hive.exec.reducers.bytes.per.reducer=67108864;  
set hive.exec.reducers.max = 2000; --default 1009 

The number of reducers determined according to

mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapred.reduce.tasks is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer Each reducer will create one file for each partition (not bigger than hive.exec.reducers.bytes.per.reducer ). It's possible that one reducer will receive many partitions data and as a result will create many small files in each partition. It's because on shuffle phase partitions data will be distributed between many reducers.

If you do not want each reducer to create every (or too many) partitions then distribute by partition key (instead of order). In this case the number of files in the partition will be more like partition_size/hive.exec.reducers.bytes.per.reducer

查看更多
登录 后发表回答