I have a Spark application running on top of yarn. Having an RDD I need to execute a query against the database. The problem is that I have to set proper connection options otherwise the database will be overloaded. And these options depend on the number of workers that query this DB simultaneously. To solve this problem I want to detect the current number of running workers in runtime (from a worker). Something like that:
val totalDesiredQPS = 1000 //queries per second
val queries: RDD[String] = ???
queries.mapPartitions(it => {
val dbClientForThisWorker = ...
//TODO: get this information from YARN somehow
val numberOfContainers = ???
val dbClientForThisWorker.setQPS(totalDesiredQPS / numberOfContainers)
it.map(query => dbClientForThisWorker.executeAsync...)
....
})
Also I appreciate alternative solutions but I want to avoid shuffle and get almost full db utilization no matter what the number of worker is.