Is there any way to cache a cache sql query result without using rdd.cache()? for examples:
output = sqlContext.sql("SELECT * From people")
We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.
So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?
The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments
then you get cached reads both for subsequent usages of
interesting_query
, as well as on all queries onCACHED_TABLE
.This answer is based off of the accepted answer, but the power of using
AS
is what really made the call useful in the more constrained SQL-only environments, where you cannot.collect()
or do RDD/Dataframe-operations in any way.You should use
sqlContext.cacheTable("table_name")
in order to cache it, or alternatively useCACHE TABLE table_name
SQL query.Here's an example. I've got this file on HDFS:
Then the code in PySpark:
Now we have a table and can query it:
To persist it, we have 3 options:
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution: