Pig vs Hive vs Native Map Reduce

2019-01-22 03:10发布

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.

I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?

7条回答
Ridiculous、
2楼-- · 2019-01-22 03:29

All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). PIG and HIVE uses MR/SPARK/TEZ underneath. So all the things which MR can do may or may not be possible in Hive and PIG.

查看更多
祖国的老花朵
3楼-- · 2019-01-22 03:32

Hive

Pros:

Sql like Data-base guys love that. Good support for structured data. Currently support database schema and views like structure Support concurrent multi users, multi session scenarios. Bigger community support. Hive , Hiver server , Hiver Server2, Impala ,Centry already

Cons: Performance degrades as data grows bigger not much to do, memory over flow issues. cant do much with it. Hierarchical data is a challenge. Un-structured data requires udf like component Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc

Pig: Pros: Great script based data flow language.

Cons:

Un-structured data requires udf like component Not a big community support

MapReudce: Pros: Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code. Most of the times MR yields better performance. MR support for hierarchical data is great especially implement tree like structures. Better control at partitioning / indexing the data. Job chaining.

Cons: Need to know api very well to get a better performance etc Code / debug / maintain

查看更多
可以哭但决不认输i
4楼-- · 2019-01-22 03:41

Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.

It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. You could write UDFs or do something and achieve that.

There is no clear distinction as such among the usage of these tools. It totally depends on your particular use-case. Based on your data and the kind of processing you need to decide which tool fits into your requirements better.

Edit :

Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. The format of the files holding this data was somewhat weird. Some part of the data was EBCDIC encoded, while rest of the data was in binary format. It was basically a flat binary file with no delimiters like\n or something. I had a tough time finding some way to process these files using Pig or Hive. As a result I had to settle down with MR. Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you.

So, like I said earlier it basically depends on your use case. For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable.

Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured.

Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive.

And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be better in such situations as well. For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with.

HTH

查看更多
姐就是有狂的资本
5楼-- · 2019-01-22 03:47

Scenarios where Hadoop Map Reduce is preferred to Hive or PIG

  1. When you need definite driver program control

  2. Whenever the job requires implementing a custom Partitioner

  3. If there already exists pre-defined library of Java Mappers or Reducers for a job

  4. If you require good amount of testability when combining lots of large data sets
  5. If the application demands legacy code requirements that command physical structure
  6. If the job requires optimization at a particular stage of processing by making the best use of tricks like in-mapper combining
  7. If the job has some tricky usage of distributed cache (replicated join), cross products, groupings or joins

Comparison between Map reduce/ Pig/ Hive

Pros of Pig/Hive :

  1. Hadoop MapReduce requires more development effort than Pig and Hive.
  2. Pig and Hive coding approaches are slower than a fully tuned Hadoop MapReduce program.
  3. When using Pig and Hive for executing jobs, Hadoop developers need not worry about any version mismatch.
  4. There is very limited possibility for the developer to write java level bugs when coding in Pig or Hive.

Have a look at this post for Pig Vs Hive comparison.

查看更多
SAY GOODBYE
6楼-- · 2019-01-22 03:47

Here is the great comparison. It specifies all the use case scenarios.

查看更多
【Aperson】
7楼-- · 2019-01-22 03:48

Mapreduce:

Strengths:
      works both on structured and unstructured data.
      good for writing complex business logic.

Weakness:
     long development type
     hard to achieve join functionality

Hive :

Strengths:
     less development time.
     suitable for adhoc analysis.
     easy for joins

Weakness :
     not easy for complex business logic.
     deals only structured data.

Pig

Strengths :
      Structured and unstructured data.
      joins are easily written.

Weakness:
     new language to learn.
     converted into mapreduce.
查看更多
登录 后发表回答