What does “RDDs can be stored in memory” mean in S

In the introduction of Spark,it says

RDDs can be stored in memory between queries without requiring replication.

As I know,you must cache RDD manually by using .cache() or .persist().If I take neither measure,like below

   val file = sc.textFile("hdfs://data/kv1.txt")
   file.flatMap(line => line.split(" "))
   file.count()

I don't persist the RDD "file" in cache or disk,in this condition, can Spark run faster than MapReduce?

标签： mapreduce apache-spark

2条回答

forever°为你锁心

2楼-- · 2019-07-07 08:28

What will happen is that Spark will compute, partition by partition, each stage of the computation. It will hold some data temporarily in memory to do its work. It may have to spill data to disk and transfer across the network to execute some stages. But none of this is (necessarily) persistent. If you count() again it would start from scratch.

This is not a case where Spark would run faster than MapReduce; it would probably be slower for a simple operation like this. In fact, there is nothing about this that would benefit from loading into memory.

More complex examples, like with a non-trivial pipeline or repeated access to the RDD, would show a benefit from persisting in memory, or even on disk.

0人赞添加讨论(0) 举报

等我变得足够好

3楼-- · 2019-07-07 08:51

Yes tonyking, it will run faster than MapReduce no doubt. Spark processing all RDDs as in memory, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

http://spark.apache.org/docs/latest/programming-guide.html

"This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank"

The answer for your question : "What does “RDDs can be stored in memory” mean in Spark?" is we can STORE one RDD in RAM using .cache() without re computation (while we are applying an action on it).

0人赞添加讨论(0) 举报

What does “RDDs can be stored in memory” mean in S

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间