cache tables in apache spark sql

2019-04-11 23:24发布

From the Spark official document, it says:

Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable("tableName") to remove the table from memory.

What does caching tables using a in-memory columnar format really mean? Put the whole table into the memory? As we know that cache is also lazy, the table is cached after the first action on the query. Does it make any difference to the cached table if choosing different actions or queries? I've googled this cache topic several times but failed to find some detailed articles. I would really appreciate it if anyone can provides some links or articles for this topic.

http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

标签： caching apache-spark apache-spark-sql

1条回答

我命由我不由天

2楼-- · 2019-04-12 00:06

Yes, caching the tables put the whole table in memory compressed if you use this setting: spark.sql.inMemoryColumnarStorage.compressed = true. Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so:

CACHE LAZY TABLE MyTableName

0人赞添加讨论(0) 举报

cache tables in apache spark sql

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间