Should I be creating and dropping intermediate tables in hive?
I can write something like (much simplified):
drop table if exists tmp1;
create table tmp1 as
select a, b, c
from input1
where a > 1 and b < 3;
drop table if exists tmp2;
create table tmp2 as
select x, y, z
from input2
where x < 6;
drop table if exists output;
create table output as
select x, a, count(*) as count
from tmp1 join tmp2 on tmp1.c = tmp2.z
group by tmp1.b;
drop table tmp1;
drop table tmp2;
or I can roll everything into one statement:
drop table if exists output;
create table output as
select x, a, count(*) as count
from (select a, b, c
from input1
where a > 1 and b < 3) t1
join (select x, y, z
from input2
where x < 6) t2
on t1.c = t2.z
group by t1.b;
Obviously, if I reuse the intermediate tables more than once, it makes perfect sense to create them. However, when they are used just once, I have a choice.
I tried both and the second one is 6% faster as measured by the wall time, but 4% slower as measured by the MapReduce Total cumulative CPU time
log output.
This difference is probably within the random margin of error (caused by other processes &c).
However, is it possible that combining queries could result in a dramatic speedup?
Another question is: are intermediate tables, which are used just once, a normal occurrence in hive code, or should they be avoided when possible?
I think combining queries is a good thing. It allows Hive query optimizer to, well, optimize query.
Consider this silly query:
When you run it Hive'll launch only one MapReduce job.
Using intermediate tables
will obviously run two MapReduce jobs.
So to answer your question: yes, combining queries could result in speedup.
As you have discovered there may not be a major difference in timing. Most likely you would want to maintain the temporary tables for (a) "savepoint" / intermediate rollback or (b) troubleshooting purposes. Otherwise it may not be worth the management effort to remember (or automate) the cleanup/dropping of the intermediate tables.
There is one significant difference.
Running the one big query will allow the optimizer more freedom in optimizations.
One of the most important optimizations in such cases are paralellims as set in
hive.exec.parallel
. when set to true hive will exexcute independant stages in parallel.In your case, in the 2nd query imagine t1,t2 do more complex work like
group by
. in the 2nd query t1,t2 will execute simultaniusly while in the first script the will be serial.I like to create multiple views, and then only create a table at the end. This allows the Hive optimizer to reduce the number of map-reduce steps, and execute in parallel as dimamah and Nigel have pointed out, but helps maintain readability for very complicated pipelines.
For your example, you could replace it with