Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?
问题:
回答1:
When a simple statement like this is executed select * from tablename
, what hive does is simply to fetch the data from the file stored in hdfs and bring it out in a columnar output format. Basically it generates a statement like
hadoop fs -cat hdfs://schemaname/tablename.txt
hadoop fs -cat hdfs://schemaname/tablename.rc
hadoop fs -cat hdfs://schemaname/tablename.orc
Or in whichever format your table's file is stored.
If you try selecting a column or adding a where clause to the query or using any aggregate on the table, MR comes into picture for obvious reasons.
回答2:
Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. Whereas whenever you do a 'select column', a map job internally picks that particular column and gives the output.
There was also a bug filed for this to make 'select column' query run without mapreduce. Check the details here: https://issues.apache.org/jira/browse/HIVE-887