What are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.
I would also like to know how Hive compares with Pig.
1.We are using Hadoop for storing Large data (i.e.structure,Unstructure and Semistructure data ) in the form file format like txt,csv.
2.If We want columnar Updations in our data then we are using Hbase tool
3.In case of Hive , we are storing Big data which is in structured format and in addition to that we are providing Analysis on that data.
4.Pig is tool which is using Pig latin language to analyze data which is in any format(structure,semistructure and unstructure).
Pig: it is better to handle files and cleaning data example: removing null values,string handling,unnecessary values Hive: for querying on cleaned data
Cleansing Data in Pig is very easy,a suitable approach would be cleansing data through pig and then processing data through hive and later uploading it to hdfs.
I worked on Lambda architecture processing Real time and Batch loads. Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions. Batch processing is needed to summarize data which can be feed into BI systems.
we used Hadoop ecosystem technologies for above applications.
Real Time Processing
Apache Storm: Stream Data processing, Rule application
HBase: Datastore for serving Realtime dashboard
Batch Processing Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
Event Handling layer
Apache Kafka was first layer to consume high velocity events from sensor. Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.
Let me try to answer in few words.
Hadoop is an eco-system which comprises of all other tools. So, you can't compare Hadoop but you can compare MapReduce.
Here are my few cents:
Understanding in depth
Hadoop
Hadoop
is an open source project of theApache
foundation. It is a framework written inJava
, originally developed by Doug Cutting in 2005. It was created to support distribution forNutch
, the text search engine.Hadoop
uses Google'sMap Reduce
and Google File System Technologies as its foundation.Features of Hadoop
Hadoop
is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate.RDBMS
.Versions of Hadoop
There are two versions of
Hadoop
available :Hadoop 1.0
It has two main parts :
1. Data Storage Framework
It is a general-purpose file system called Hadoop Distributed File System (
HDFS
).HDFS
is schema-lessIt simply stores data files and these data files can be in just about any format.
The idea is to store files as close to their original form as possible.
This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.
2. Data Processing Framework
This is a simple functional programming model initially popularized by Google as
MapReduce
.It essentially uses two functions:
MAP
andREDUCE
to process data.The "Mappers" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs).
The "Reducers" then act on this input to produce the output data.
The two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.
Limitations of Hadoop 1.0
The first limitation was the requirement of
MapReduce
programming expertise.It supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.
One major limitation was that
Hadoop 1.0
was tightly computationally coupled withMapReduce
, which meant that the established data management vendors where left with two opinions:Either rewrite their functionality in
MapReduce
so that it could be executed inHadoop
orExtract data from
HDFS
or process it outside ofHadoop
.None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the
Hadoop
cluster.Hadoop 2.0
In
Hadoop 2.0
,HDFS
continues to be data storage framework.However, a new and seperate resource management framework called Yet Another Resource Negotiater (YARN) has been added.
Any application capable of dividing itself into parallel tasks is supported by YARN.
YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications.
It works by having an Application Master in place of Job Tracker, running applications on resources governed by new Node Manager.
ApplicationMaster is able to run any application and not just
MapReduce
.This means it does not only support batch processing but also real-time processing.
MapReduce
is no longer the only data processing option.Advantages of Hadoop
It stores data in its native from. There is no structure imposed while keying in data or storing data.
HDFS
is schema less. It is only later when the data needs to be processed that the structure is imposed on the raw data.It is scalable.
Hadoop
can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel.It is resilient to failure.
Hadoop
is fault tolerance. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.It is flexible. One of the key advantages of
Hadoop
is that it can work with any kind of data: structured, unstructured or semi-structured. Also, the processing is extremely fast inHadoop
owing to the "move code to data" paradigm.Hadoop Ecosystem
Following are the components of
Hadoop
ecosystem:HDFS:
Hadoop
Distributed File System. It simply stores data files as close to the original form as possible.HBase: It is Hadoop's database and compares well with an
RDBMS
. It supports structured data storage for large tables.Hive: It enables analysis of large datasets using a language very similar to standard
ANSI SQL
, which implies that anyone familier withSQL
should be able to access data on aHadoop
cluster.Pig: It is an easy to understand data flow language. It helps with analysis of large datasets which is quite the order with
Hadoop
.Pig
scripts are automatically converted toMapReduce
jobs by thePig
interpreter.ZooKeeper: It is a coordination service for distributed applications.
Oozie: It is a workflow
schedular
system to manage ApacheHadoop
jobs.Mahout: It is a scalable machine learning and data mining library.
Chukwa: It is data collection system for managing large distributed system.
Sqoop: It is used to transfer bulk data between
Hadoop
and structured data stores such as relational databases.Ambari: It is a web based tool for provisioning, managing and monitoring
Hadoop
clusters.Hive
Hive
is a data warehouse infrastructure tool to process structured data inHadoop
. It resides on top ofHadoop
to summarize Big Data and makes querying and analyzing easy.Hive is not
A relational database
A design for Online Transaction Processing (
OLTP
).A language for real-time queries and row-level updates.
Features of Hive
It stores schema in database and processed data into
HDFS
.It is designed for
OLAP
.It provides
SQL
type language for querying calledHiveQL
orHQL
.It is familier, fast, scalable and extensible.
Hive Architecture
The following components are contained in Hive Architecture:
User Interface:
Hive
is adata warehouse
infrastructure that can create interaction between user andHDFS
. The User Interfaces thatHive
supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server).MetaStore:
Hive
chooses respectivedatabase
servers
to store the schema orMetadata
of tables, databases, columns in a table, their data types andHDFS
mapping.HiveQL Process Engine:
HiveQL
is similar toSQL
for querying on schema info on theMetastore
. It is one of the replacements of traditional approach forMapReduce
program. Instead of writingMapReduce
inJava
, we can write a query forMapReduce
and process it.Exceution Engine: The conjunction part of
HiveQL
process engine andMapReduce
is theHive
Execution Engine. Execution engine processes the query and generates results as same asMapReduce results
. It uses the flavor ofMapReduce
.HDFS or HBase:
Hadoop
Distributed File System orHBase
are the data storage techniques to store data into file system.