Why do I have to explicitly tell Spark what to cac

In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.

Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.

Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.

So I fail to understand, why I have to explicitly cache as,

It looks ugly
It can easily be missed
It can easily be over/under used

标签： apache-spark caching

1条回答

何必那么认真

2楼-- · 2020-01-26 10:24

A subjective list of reasons:

in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application.
on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
it is impossible to cache automatically without making assumptions about application semantics. In particular:
- expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
- differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
comparing Spark caching to OS level caching doesn't make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
if cache doesn't use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.
depending on the data and caching method reading data from cache can be significantly less efficient memory-wise.
Caching interferes with more advanced optimizations available in Spark SQL, effectively disabling partition pruning or predicate and projection pushdown.

It is also worth to note that:

removing cached data is handled automatically using LRU
some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
Spark caching doesn't affect system level or JVM level mechanisms

0人赞添加讨论(0) 举报

Why do I have to explicitly tell Spark what to cac

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间