How do the hive sql queries are submitted as mr jo

2019-02-27 02:16发布

问题:

I have deployed a CDH-5.9 cluster with MR as hive execution engine. I have a hive table named "users" with 50 rows. Whenever I execute the query select * from users works fine as follows :

hive> select * from users;
OK

Adam       1       38     ATK093   CHEF
Benjamin   2       24     ATK032   SERVANT
Charles    3       45     ATK107   CASHIER
Ivy        4       30     ATK384   SERVANT
Linda      5       23     ATK132   ASSISTANT 
. 
.
.

Time taken: 0.059 seconds, Fetched: 50 row(s)

But issuing select max(age) from users failed after submitting as mr job. The container log also doesn't have any information to figure it out why its getting failed.

      hive> select max(age) from users;
        Query ID = canballuser_20170808020101_5ed7c6b7-097f-4f5f-af68-486b45d7d4e
        Total jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks determined at compile time: 1
        In order to change the average load for a reducer (in bytes):
        set hive.exec.reducers.bytes.per.reducer=<number>
        In order to limit the maximum number of reducers:
        set hive.exec.reducers.max=<number>
        In order to set a constant number of reducers:
        set mapreduce.job.reduces=<number>
        Starting Job = job_1501851520242_0010, Tracking URL = http://hadoop-master:8088/proxy/application_1501851520242_0010/
        Kill Command = /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hadoop/bin/hadoop job  -kill job_1501851520242_0010
        Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
        2017-08-08 02:01:11,472 Stage-1 map = 0%,  reduce = 0%
        Ended Job = job_1501851520242_0010 with errors
        Error during job, obtaining debugging information...
        FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
        MapReduce Jobs Launched:
        Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 FAIL
        Total MapReduce CPU Time Spent: 0 msec

If I get the workflow of the hive query execution from hive cli, it might be helpful for me to debug the issue further.

回答1:

There are a lot of components involved in Hive query execution flow. High level architecture is explained here: https://cwiki.apache.org/confluence/display/Hive/Design

There are links in this document to more detailed component documents.

Typical query execution flow (High Level)

  1. The UI calls the execute interface to the Driver.
  2. The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan.
  3. The compiler gets the necessary metadata from the metastore. This metadata is used to typecheck the expressions in the query tree as well as to prune partitions based on query predicates.
  4. The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
  5. The execution engine submits these stages to appropriate components In each task (mapper/reducer) the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files and these are passed through the associated operator tree. Once the output is generated, it is written to a temporary HDFS file though the serializer (this happens in the mapper in case the operation does not need a reduce). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to the table's location. This scheme is used to ensure that dirty data is not read (file rename being an atomic operation in HDFS).
  6. For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver .

Hive documentatio root is here :https://cwiki.apache.org/confluence/display/Hive/Home You can find more details about different components. Also you can study sources for more details about some class implementation.

Hadoop Job tracker docs: https://wiki.apache.org/hadoop/JobTracker