I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.
I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?
All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). PIG and HIVE uses MR/SPARK/TEZ underneath. So all the things which MR can do may or may not be possible in Hive and PIG.
Hive
Pros:
Sql like Data-base guys love that. Good support for structured data. Currently support database schema and views like structure Support concurrent multi users, multi session scenarios. Bigger community support. Hive , Hiver server , Hiver Server2, Impala ,Centry already
Cons: Performance degrades as data grows bigger not much to do, memory over flow issues. cant do much with it. Hierarchical data is a challenge. Un-structured data requires udf like component Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc
Pig: Pros: Great script based data flow language.
Cons:
Un-structured data requires udf like component Not a big community support
MapReudce: Pros: Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code. Most of the times MR yields better performance. MR support for hierarchical data is great especially implement tree like structures. Better control at partitioning / indexing the data. Job chaining.
Cons: Need to know api very well to get a better performance etc Code / debug / maintain
Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.
It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. You could write UDFs or do something and achieve that.
There is no clear distinction as such among the usage of these tools. It totally depends on your particular use-case. Based on your data and the kind of processing you need to decide which tool fits into your requirements better.
Edit :
Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. The format of the files holding this data was somewhat weird. Some part of the data was EBCDIC encoded, while rest of the data was in binary format. It was basically a flat binary file with no delimiters like\n or something. I had a tough time finding some way to process these files using Pig or Hive. As a result I had to settle down with MR. Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you.
So, like I said earlier it basically depends on your use case. For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable.
Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured.
Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive.
And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be better in such situations as well. For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with.
HTH
Scenarios where Hadoop Map Reduce is preferred to Hive or PIG
When you need definite driver program control
Whenever the job requires implementing a custom Partitioner
If there already exists pre-defined library of Java Mappers or Reducers for a job
Pros of Pig/Hive :
Have a look at this post for Pig Vs Hive comparison.
Here is the great comparison. It specifies all the use case scenarios.
Mapreduce:
Hive :
Pig