Environment
- Scala
- Apache Spark: Spark 2.2.1
- EMR on AWS: emr-5.12.1
Content
I have one large DataFrame, like below:
val df = spark.read.option("basePath", "s3://some_bucket/").json("s3://some_bucket/group_id=*/")
There are JSON files ~1TB at s3://some_bucket
and it includes 5000 partitions of group_id
.
I want to execute conversion using SparkSQL, and it differs by each group_id
.
The Spark code is like below:
// Create view
val df = spark.read.option("basePath", "s3://data_lake/").json("s3://data_lake/group_id=*/")
df.createOrReplaceTempView("lakeView")
// one of queries like this:
// SELECT
// col1 as userId,
// col2 as userName,
// .....
// FROM
// lakeView
// WHERE
// group_id = xxx;
val queries: Seq[String] = getGroupIdMapping
// ** Want to know better ways **
queries.par.foreach(query => {
val convertedDF: DataFrame = spark.sql(query)
convertedDF.write.save("s3://another_bucket/")
})
The par
can parallelize by Runtime.getRuntime.availableProcessors
num, and it will be equal to the number of driver's cores.
But It seems weird and not efficient enough because it has nothing to do with Spark's parallization.
I really want to do with something like groupBy
in scala.collection.Seq
.
This is not right spark code:
df.groupBy(groupId).foreach((groupId, parDF) => {
parDF.createOrReplaceTempView("lakeView")
val convertedDF: DataFrame = spark.sql(queryByGroupId)
convertedDF.write.save("s3://another_bucket")
})