What does “RDDs can be stored in memory” mean in S

2019-07-07 07:56发布

In the introduction of Spark,it says

RDDs can be stored in memory between queries without requiring replication.

As I know,you must cache RDD manually by using .cache() or .persist().If I take neither measure,like below

   val file = sc.textFile("hdfs://data/kv1.txt")
   file.flatMap(line => line.split(" "))
   file.count()

I don't persist the RDD "file" in cache or disk,in this condition, can Spark run faster than MapReduce?

2条回答
forever°为你锁心
2楼-- · 2019-07-07 08:28

What will happen is that Spark will compute, partition by partition, each stage of the computation. It will hold some data temporarily in memory to do its work. It may have to spill data to disk and transfer across the network to execute some stages. But none of this is (necessarily) persistent. If you count() again it would start from scratch.

This is not a case where Spark would run faster than MapReduce; it would probably be slower for a simple operation like this. In fact, there is nothing about this that would benefit from loading into memory.

More complex examples, like with a non-trivial pipeline or repeated access to the RDD, would show a benefit from persisting in memory, or even on disk.

查看更多
等我变得足够好
3楼-- · 2019-07-07 08:51

Yes tonyking, it will run faster than MapReduce no doubt. Spark processing all RDDs as in memory, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

http://spark.apache.org/docs/latest/programming-guide.html

"This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank"

The answer for your question : "What does “RDDs can be stored in memory” mean in Spark?" is we can STORE one RDD in RAM using .cache() without re computation (while we are applying an action on it).

查看更多
登录 后发表回答