Hive SQL coding style: intermediate tables?

2019-05-20 19:00发布

Should I be creating and dropping intermediate tables in hive?

I can write something like (much simplified):

drop table if exists tmp1;
create table tmp1 as
select a, b, c
from input1
where a > 1 and b < 3;

drop table if exists tmp2;
create table tmp2 as
select x, y, z
from input2
where x < 6;

drop table if exists output;
create table output as
select x, a, count(*) as count
from tmp1 join tmp2 on tmp1.c = tmp2.z
group by tmp1.b;
drop table tmp1;
drop table tmp2;

or I can roll everything into one statement:

drop table if exists output;
create table output as
select x, a, count(*) as count
from (select a, b, c
    from input1
    where a > 1 and b < 3) t1
join (select x, y, z
    from input2
    where x < 6) t2
on t1.c = t2.z
group by t1.b;

Obviously, if I reuse the intermediate tables more than once, it makes perfect sense to create them. However, when they are used just once, I have a choice.

I tried both and the second one is 6% faster as measured by the wall time, but 4% slower as measured by the MapReduce Total cumulative CPU time log output. This difference is probably within the random margin of error (caused by other processes &c). However, is it possible that combining queries could result in a dramatic speedup?

Another question is: are intermediate tables, which are used just once, a normal occurrence in hive code, or should they be avoided when possible?

标签: sql hadoop hive
4条回答
我只想做你的唯一
2楼-- · 2019-05-20 19:10

I think combining queries is a good thing. It allows Hive query optimizer to, well, optimize query.

Consider this silly query:

SELECT COUNT(*) FROM (SELECT * FROM clicks WHERE dt = '2014-01-07') t;

When you run it Hive'll launch only one MapReduce job.

Using intermediate tables

CREATE TABLE tmp AS SELECT * FROM clicks WHERE dt = '2014-01-07';
SELECT COUNT(*) FROM tmp;

will obviously run two MapReduce jobs.

So to answer your question: yes, combining queries could result in speedup.

查看更多
看我几分像从前
3楼-- · 2019-05-20 19:14

As you have discovered there may not be a major difference in timing. Most likely you would want to maintain the temporary tables for (a) "savepoint" / intermediate rollback or (b) troubleshooting purposes. Otherwise it may not be worth the management effort to remember (or automate) the cleanup/dropping of the intermediate tables.

查看更多
小情绪 Triste *
4楼-- · 2019-05-20 19:20

There is one significant difference.
Running the one big query will allow the optimizer more freedom in optimizations.
One of the most important optimizations in such cases are paralellims as set in hive.exec.parallel. when set to true hive will exexcute independant stages in parallel.
In your case, in the 2nd query imagine t1,t2 do more complex work likegroup by. in the 2nd query t1,t2 will execute simultaniusly while in the first script the will be serial.

查看更多
欢心
5楼-- · 2019-05-20 19:36

I like to create multiple views, and then only create a table at the end. This allows the Hive optimizer to reduce the number of map-reduce steps, and execute in parallel as dimamah and Nigel have pointed out, but helps maintain readability for very complicated pipelines.

For your example, you could replace it with

CREATE VIEW IF NOT EXISTS tmp1_view
AS
SELECT a, b, c FROM inputs
where a > 1 and b < 3;


create view if not exists tmp2_view as
select x, y, z_
from input2
where x < 6;

drop table if exists output;
create table output as
select x, a, count(*) as count
from tmp1_view join tmp2_view on tmp1_view.c = tmp2_view.z
group by tmp1_view.b;
查看更多
登录 后发表回答