So I am trying to get data from a MySQL database using Spark within a Play/Scala project. Since the amount of rows I am trying to receive is huge, my aim is to get an Iterator from the spark rdd. Here is the Spark context and configuration...
private val configuration = new SparkConf()
.setAppName("Reporting")
.setMaster("local[*]")
.set("spark.executor.memory", "2g")
.set("spark.akka.timeout", "5")
.set("spark.driver.allowMultipleContexts", "true")
val sparkContext = new SparkContext(configuration)
The JDBCRDD is as follows along with the sql query
val query =
"""
|SELECT id, date
|FROM itembid
|WHERE date BETWEEN ? AND ?
""".stripMargin
val rdd = new JdbcRDD[ItemLeadReportOutput](SparkProcessor.sparkContext,
driverFactory,
query,
rangeMinValue.get,
rangeMaxValue.get,
partitionCount,
rowMapper)
.persist(StorageLevel.MEMORY_AND_DISK)
The data is too much to get it at once. At the beginning with smaller data sets it was possible the get an iterator from rdd.toLocalIterator. However in this specific case it can not compute an iterator. So my aim is to have multiple partitions and recevie data part by part. I keep getting errors. What is the correct way of doing this ?
I believe that you are facing a heap problem read your MySQL table.
What I'll do in your case is to fetch the data from MySQL into the storage system (HDFS, local) files and then I'll use spark's context textFile to fetch it!
Example :
Once your data is stored you can read it:
PS: I've used some shortcuts (CSVWriter per example) in the code but you can use it as a skeleton to what you are intending to do!