I am always confused on how many mappers and reduces will get created for a particular task in hive. e.g If block size = 128mb and there are 365 files each maps to a date in a year(file size=1 mb each). There is partition based on date column. In this case how many mappers and reducers will be run during loading the data?
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- 在hive sql里怎么把"2020-10-26T08:41:19.000Z"这个字符串转换成年月日
- Java写文件至HDFS失败
- mapreduce count example
- SQL query Frequency Distribution matrix for produc
- Cloudera 5.6: Parquet does not support date. See H
- Could you give me any clue Why 'Cannot call me
- converting to timestamp with time zone failed on A
- Hive error: parseexception missing EOF
Mappers:
Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
MapReduce:
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Reducers: Controlling the number of reducers is much easier. The number of reducers determined according to
mapreduce.job.reduces
- The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.hive.exec.reducers.bytes.per.reducer
- The default in Hive 0.14.0 and earlier is 1 GB.Also
hive.exec.reducers.max
- Maximum number of reducers that will be used. Ifmapreduce.job.reduces
is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.So, if you want to increase reducers parallelism, increase
hive.exec.reducers.max
and decreasehive.exec.reducers.bytes.per.reducer