可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to understand how hive and hadoop interact. From the tutorials I have read I appears that prior to running HIVE queries you run a map / reduce job to get the input data. This seems counterproductive to me, if I have already run the map / reduce job and gotten the data in an easily parsable format why would I not put the data into a traditional database.

Thanks for your help, Nathan

回答1:

Hive operates on files that are stored on HDFS. For anything other than the simplest queries, hive generates and runs mapreduce jobs. For very simple queries (SELECT * FROM MyTable) it will just stream the files off of disk.

The input data doesn't need to come from MapReduce- it can be a simple text file uploaded to HDFS. See http://developer.yahoo.com/hadoop/tutorial/module2.html#commandref

回答2:

Hive fills very important void in the open source software by providing functionality of massive parralel processing database. In other worlds - it gives us horizontally scalable analytical SQL engine.
Specifically to Your question I can see a few main scenarious when Hive is better then RDMS.
a) Data is already in the HDFS and we have some other usage of it there (like MR jobs)
b) There is too much data to be loaded into single server RDMBS.
c) We need to query data only once or twice. In this cases Hive can outperform RDMBS with thier reliatively slow data loading time.

回答3:

Yes. Hive is built on the top of Hadoop which has distributed computation. Hive accesses HDFS for storing files. Every table is stored as a file on HDFS.