I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.
Note: I am executing this SparkSQL application using yarn-client
Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator
method on a RDD:
val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row] = df.rdd.toLocalIterator
Actually you can just use: df.toLocalIterator
, here is the reference in Spark source code:
/**
* Return an iterator that contains all of [[Row]]s in this Dataset.
*
* The iterator will consume as much memory as the largest partition in this Dataset.
*
* Note: this results in multiple Spark jobs, and if the input Dataset is the result
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input Dataset should be cached first.
*
* @group action
* @since 2.0.0
*/
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
}
}