I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.
I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.
How does Impala provide faster query response compared to Hive for the same data on HDFS?
You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".
In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.
The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.
That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.
IMHO, SQL on HDFS and SQL on Hadoop are the same. After all Hadoop is HDFS( and also MapReduce). So when we say SQL on HDFS, it is understood that it is SQL on Hadoop(could be with or without MapReduce).
Coming back to the actual question, Impala provides faster response as it uses MPP
(massively parallel processing) unlike Hive which uses MapReduce under the hood, which involves some initial overheads (as Charles sir has specified). Massively parallel processing is a type of computing that uses many separate CPUs running in parallel to execute a single program where each CPU has it's own dedicated memory. The very fact that Impala, being MPP based, doesn't involve the overheads of a MapReduce jobs viz. job setup and creation, slot assignment, split creation, map generation etc., makes it blazingly fast.
But that doesn't mean that Impala is the solution to all your problems. Being highly memory intensive (MPP), it is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory. This is where Hive is a better fit.
So, if you need real time, ad-hoc queries over a subset of your data go for Impala. And if you have batch processing kinda needs over your Big Data go for Hive.
HTH
There are some key features in impala that makes its fast.
It does not use map/reduce which are very expensive to fork in
separate jvms. It runs separate Impala Daemon which splits the query
and runs them in parallel and merge result set at the end.
It does most of its operation in-memory.
It uses hdfs for its storage which is fast for large files. It
caches as much as possible from queries to results to data.
It supports new file format like parquet, which is columnar file
format. So if you use this format it will be faster for queries where
you are accessing only few columns
most of the time.