I am executing a query in spark sql like below. The data of the tables is stored in 2 different nodes in hive tables.
But because the query is a bit slow I try to find some options in spark so the query can execute faster. So I found that we can configure sparksql.sql.codegen
and spark.sql.inMemoryColumnarStorage.compressed
to true instead of the default false.
But I'm not having any improvement, the query with this two options at true is taking 4,1 minutes to execute. With this options at false is taking also 4,1 minutes.
Do you understand why this options arent working?
query = hiveContext.sql("""select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= '1998-09-16'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus""");
query.collect();
spark.sql.codegen.wholeStage
is enabled by default for spark 2.0. and it will do all the internal optimization possible from the spark catalist side .spark.sql.codegen
(which is feature in Spark 1.3+) is by defaultfalse
. Even if you make as true, you can cross-check withDF.explain / debug
However, pls. Re-visit the approach which was explained in spark 2+ as below.
If you are using lower version of spark i.e 1.3 or 1.4+ the same DataFrame approach is valid except we have to use with hiveContext.
please try the below pseudo code.
create a data frame with out any aggregation, group, order by like this.