Get current number of running containers in Spark

2019-08-19 07:37发布

问题:

I have a Spark application running on top of yarn. Having an RDD I need to execute a query against the database. The problem is that I have to set proper connection options otherwise the database will be overloaded. And these options depend on the number of workers that query this DB simultaneously. To solve this problem I want to detect the current number of running workers in runtime (from a worker). Something like that:

val totalDesiredQPS = 1000 //queries per second
val queries: RDD[String] = ???
queries.mapPartitions(it => {
      val dbClientForThisWorker = ...
      //TODO: get this information from YARN somehow
      val numberOfContainers = ???
      val dbClientForThisWorker.setQPS(totalDesiredQPS / numberOfContainers)
      it.map(query => dbClientForThisWorker.executeAsync...)
      ....
})

Also I appreciate alternative solutions but I want to avoid shuffle and get almost full db utilization no matter what the number of worker is.