When to use Hadoop, HBase, Hive and Pig?

2019-01-15 23:21发布

What are the benefits of using either Hadoop or HBase or Hive ?

From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.

I would also like to know how Hive compares with Pig.

13条回答
beautiful°
2楼-- · 2019-01-16 00:05

For a Comparison Between Hadoop Vs Cassandra/HBase read this post.

Basically HBase enables really fast read and writes with scalability. How fast and scalable? Facebook uses it to manage its user statuses, photos, chat messages etc. HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself.

Where As Hive is more like a Data Warehousing solution. You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job. Not ideal for fast, transactional systems.

查看更多
混吃等死
3楼-- · 2019-01-16 00:08

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.

I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started. (Just to add on to my answer. No self promotion intended)

Both Hive and Pig queries get converted into MapReduce jobs under the hood.

HTH

查看更多
霸刀☆藐视天下
4楼-- · 2019-01-16 00:09

First of all we should get clear that Hadoop was created as a faster alternative to RDBMS. To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS.

Now one should know the two terms :

  1. Structured Data : This is the data that we used in traditional RDBMS and is divided into well defined structures.

  2. Unstructured Data : This is important to understand, about 80% of the world data is unstructured or semi structured. These are the data which are on its raw form and cannot be processed using RDMS. Example : facebook, twitter data. (http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html).

So, large amount of data was being generated in the last few years and the data was mostly unstructured, that gave birth to HADOOP. It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS. It had many drawbacks, that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version.

Before going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools. So, whichever tool you will see that is created has been done to overcome the problem of the previous tools.

Hadoop can be simply said as two things : Mapreduce and HDFS. Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored. This structure followed WORM principal i.e. write once read multiple times. So, once we have stored data in HDFS, we cannot make changes. This led to the creation of HBASE, a NOSQL product where we can make changes in the data also after writing it once.

But with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure. PIG and HIVE are two popular examples.

HIVE was created for people with SQL background. The queries written is similar to SQL named as HIVEQL. HIVE was developed to process completely structured data. It is not used for ustructured data.

PIG on the other hand has its own query language i.e. PIG LATIN. It can be used for both structured as well as unstructured data.

Moving to the difference as when to use HIVE and when to use PIG, I don't think anyone other than the architect of PIG could say. Follow the link : https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html

查看更多
你好瞎i
5楼-- · 2019-01-16 00:14

Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

There are four main modules in Hadoop.

  1. Hadoop Common: The common utilities that support the other Hadoop modules.

  2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

  3. Hadoop YARN: A framework for job scheduling and cluster resource management.

  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Before going further, Let's note that we have three different types of data.

  • Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.

  • Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.

  • Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.

Depending on type of data to be processed, we have to choose right technology.

Some more projects, which are part of Hadoop:

  • HBase™: A scalable, distributed database that supports structured data storage for large tables.

  • Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.

  • Pig™: A high-level data-flow language and execution framework for parallel computation.

Hive Vs PIG comparison can be found at this article and my other post at this SE question.

HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.

You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce

You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce

You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce

Have a look at: Hadoop Use Cases.

Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.

HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.

PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.

Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.

查看更多
Ridiculous、
6楼-- · 2019-01-16 00:14

Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.

查看更多
疯言疯语
7楼-- · 2019-01-16 00:21

I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.

Objective

  1. To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
  2. To replace daily aggregation data generated thru MySQL with Hive
  3. Build Custom reports thru queries in Hive

Architecture Options

I benchmarked the following options:

  1. Hive+HDFS
  2. Hive+HBase - queries were too slow so I dumped this option

Design

  1. Daily log Files were transported to HDFS
  2. MR jobs parsed these log files and output files in HDFS
  3. Create Hive tables with partitions and locations pointing to HDFS locations
  4. Create Hive query scripts (call it HQL if u like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
  5. Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator

Summary

HBase is like a Map. If you know the key, you can instantly get the value. But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.

If you have data to be aggregated, rolled up, analyzed across rows then consider Hive.

Hopefully this helps.

Hive actually rocks really well...I know, I have lived it for 12 months now... So does HBase...

查看更多
登录 后发表回答