My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link).
I understand that-
Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.
Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.
Both Pig Latin and Hive commands compiles to Map and Reduce jobs.
My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?
Pig eats anything! Meaning it can consume unstructured data.
Hive requires a schema.
From the link: http://www.aptibook.com/discuss-technical?uid=tech-hive4&question=What-kind-of-datawarehouse-application-is-suitable-for-Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing. So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
Pig-latin is data flow style, is more suitable for software engineer. While sql is more suitable for analytics person who are get used to sql. For complex task, for hive you have to manually to create temporary table to store intermediate data, but it is not necessary for pig.
Pig-latin is suitable for complicated data structure( like small graph). There's a data structure in pig called DataBag which is a collection of Tuple. Sometimes you need to calculate metrics which involve multiple tuples ( there's a hidden link between tuples, in this case I would call it graph). In this case, it is very easy to write a UDF to calculate the metrics which involve multiple tuples. Of course it could be done in hive, but it is not so convenient as it is in pig.
Writing UDF in pig much is easier than in Hive in my opinion.
Pig has no metadata support, (or it is optional, in future it may integrate hcatalog). Hive has tables' metadata stored in database.
You can debug pig script in local environment, but it would be hard for hive to do that. The reason is point 3. You need to set up hive metadata in your local environment, very time consuming.
Here are some additional links on to use Pig or Hive.
http://aws.amazon.com/elasticmapreduce/faqs/#hive-8
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
Hive Vs Pig-
Hive is as SQL interface which allows sql savvy users or Other tools like Tableu/Microstrategy/any other tool or language that has sql interface..
PIG is more like a ETL pipeline..with step by step commands like declaring variables, looping, iterating , conditional statements etc.
I prefer writing Pig scripts over hive QL when I want to write complex step by step logic. When I am comfortable writing a single sql for pulling the data i want i use Hive. for hive you will need to define table before querying(as you do in RDBMS)
The purpose of both are different but under the hood, both do the same, convert to map reduce programs.Also the Apache open source community is add more and more features to both there projects
When we are using Hadoop in the sense it means we are trying to huge data processing The end goal of the data processing would be to generate content/reports out of it.
So it internally consists of 2 prime activities 1) Loading Data Processing 2) Generate content and use it for the reporting /etc..
Loading /Data Procesing -> Pig would be helpful in it. This helps as an ETL (We can perform etl operations using pig scripts.) Once the result is processed we can use hive to generate the reports based on the processed result.
Hive:Its built on top of hdfs for warehouse processing. WE can geenerate adhoc reports easily using hive from the processed content generated from pig.