HIVE : Why does Hive generate mapreduce job on sel

2019-08-28 22:50发布

问题:

Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?

回答1:

When a simple statement like this is executed select * from tablename, what hive does is simply to fetch the data from the file stored in hdfs and bring it out in a columnar output format. Basically it generates a statement like

hadoop fs -cat hdfs://schemaname/tablename.txt
hadoop fs -cat hdfs://schemaname/tablename.rc
hadoop fs -cat hdfs://schemaname/tablename.orc

Or in whichever format your table's file is stored.

If you try selecting a column or adding a where clause to the query or using any aggregate on the table, MR comes into picture for obvious reasons.



回答2:

Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. Whereas whenever you do a 'select column', a map job internally picks that particular column and gives the output.

There was also a bug filed for this to make 'select column' query run without mapreduce. Check the details here: https://issues.apache.org/jira/browse/HIVE-887



标签: hive