Difference between Pig and Hive? Why have both? [c

2019-01-09 20:50发布

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link).

I understand that-

  • Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.

  • Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.

  • Both Pig Latin and Hive commands compiles to Map and Reduce jobs.

My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?

19条回答
闹够了就滚
2楼-- · 2019-01-09 21:09

Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.

Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.

Pig also uses lazy evaluation. It allows greater ease of programming and one can use it to analyze data in different ways with more freedom than in an SQL like language like Hive. So if you really wanted to analyze matrices or patterns in some unstructured data you had, and wanted to do interesting calculations on them, with Pig you can go some fair distance, while with Hive, you need something else to play with the results.

Pig is faster in the data import but slower in actual execution than an RDBMS friendly language like Hive.

Pig is well suited to parallelization and so it possibly has an edge for systems where the datasets are huge, i.e. in systems where you are concerned more about the throughput of your results than the latency (the time to get any particular datum of result).

查看更多
Ridiculous、
3楼-- · 2019-01-09 21:12

I believe that the real answer to your question is that they are/were independent projects and there was no centrally coordinated goal. They were in different spaces early on and have grown to overlap with time as both projects expand.

Paraphrased from the Hadoop O'Reilly book:

Pig: a dataflow language and environment for exploring very large datasets.

Hive: a distributed data warehouse

查看更多
贪生不怕死
4楼-- · 2019-01-09 21:13

Hive was designed to appeal to a community comfortable with SQL. Its philosophy was that we don't need yet another scripting language. Hive supports map and reduce transform scripts in the language of the user's choice (which can be embedded within SQL clauses). It is widely used in Facebook by analysts comfortable with SQL as well as by data miners programming in Python. SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear.

Supporting SQL syntax also means that it's possible to integrate with existing BI tools like Microstrategy. Hive has an ODBC/JDBC driver (that's a work in progress) that should allow this to happen in the near future. It's also beginning to add support for indexes which should allow support for drill-down queries common in such environments.

Finally--this is not pertinent to the question directly--Hive is a framework for performing analytic queries. While its dominant use is to query flat files, there's no reason why it cannot query other stores. Currently Hive can be used to query data stored in Hbase (which is a key-value store like those found in the guts of most RDBMSes), and the HadoopDB project has used Hive to query a federated RDBMS tier.

查看更多
家丑人穷心不美
5楼-- · 2019-01-09 21:13

To give a very high level overview of both, in short:

1) Pig is a relational algebra over hadoop

2) Hive is a SQL over hadoop (one level above Pig)

查看更多
\"骚年 ilove
6楼-- · 2019-01-09 21:15

I found this the most helpful (though, it's a year old) - http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

It specifically talks about Pig vs Hive and when and where they are employed at Yahoo. I found this very insightful. Some interesting notes:

On incremental changes/updates to data sets:

Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.

On using other tools via streaming:

Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.

On using Hive for data warehousing:

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.

The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as ODBC.

查看更多
做自己的国王
7楼-- · 2019-01-09 21:15

In Simpler words, Pig is a high-level platform for creating MapReduce programs used with Hadoop, using pig scripts we will process the large amount of data into desired format.

Once the processed data obtained, this processed data is kept in HDFS for later processing to obtain the desired results.

On top of the stored processed data we will apply HIVE SQL commands to get the desired results, internally this hive sql commands runs MAP Reduce programs.

查看更多
登录 后发表回答