How to improve performance of loading data from NO

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.

We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.

1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs

2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.

3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.

4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.

we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.

Thanks, Mahender

回答1:

In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:

hive.exec.orc.memory.pool 
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size 
hive.exec.orc.default.buffer.size
dfs.blocksize

Also, you might see, whether compression might also help you. For example:

SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;

Hope it helps!

回答2:

First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.

Real power of hive lies in batch processing huge amount of data.