My data does not need to be loaded in realtime so I don't have to use HBASE, but I was wondering if there are any performance benefits of using HBASE in MR Jobs, shouldn't the joins be faster due to the indexed data?
Anybody have any benchmarks?
My data does not need to be loaded in realtime so I don't have to use HBASE, but I was wondering if there are any performance benefits of using HBASE in MR Jobs, shouldn't the joins be faster due to the indexed data?
Anybody have any benchmarks?
Generally speaking, hive/hdfs will be significantly faster than HBase. HBase sits on top of HDFS so it adds another layer. HBase would be faster if you are looking up individual records but you wouldn't use an MR job for that.
Performance of HBase vs. Hive:
Based on the results of HBase, Hive, and Hive on Hbase: it appears that the performance between either approach is comparable.
Hive on HBase Performance
Respectfully :) I want to tell you that if your data is not real and you are also thinking for mapreduce jobs then only go hive over hdfs as Weblogs can be processed by the Hadoop MapReduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.
As hive also provide us
Bulk processing/ real time(if possible)
as well as SQL like interface
Built in optimized map-reduce
Partitioning of large data which is more compatible with hdfs and help to reduce the layer of HBase otherwise if you add HBase here then it would be redundant features for you :)