How to share data from Spark RDD between two appli

2019-02-18 18:31发布

What is the best way to share spark RDD data between two spark jobs.

I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.

Job 2: Query job that will access the same RDD created in job 1 and generate reports.

I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.

thankyou !

3条回答
倾城 Initia
2楼-- · 2019-02-18 18:58

According to the official document describes:

Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. http://spark.apache.org/docs/latest/job-scheduling.html

查看更多
戒情不戒烟
3楼-- · 2019-02-18 19:09

You can share RDDs across different applications using Apache Ignite. Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't. Please refer https://ignite.apache.org/features/igniterdd.html for more details.

查看更多
神经病院院长
4楼-- · 2019-02-18 19:17

The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.

To get a better idea you should look here:

Serializing RDD

查看更多
登录 后发表回答