cache tables in apache spark sql

2019-04-12 00:08发布

问题:

From the Spark official document, it says:

Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable("tableName") to remove the table from memory.

What does caching tables using a in-memory columnar format really mean? Put the whole table into the memory? As we know that cache is also lazy, the table is cached after the first action on the query. Does it make any difference to the cached table if choosing different actions or queries? I've googled this cache topic several times but failed to find some detailed articles. I would really appreciate it if anyone can provides some links or articles for this topic.

http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

回答1:

Yes, caching the tables put the whole table in memory compressed if you use this setting: spark.sql.inMemoryColumnarStorage.compressed = true. Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so: