How to get an Iterator of Rows using Dataframe in

2019-02-17 11:31发布

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.

Note: I am executing this SparkSQL application using yarn-client

标签： apache-spark apache-spark-sql apache-spark-1.3

2条回答

趁早两清

2楼-- · 2019-02-17 11:46

Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:

val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row]  = df.rdd.toLocalIterator

0人赞添加讨论(0) 举报

时光不老，我们不散

3楼-- · 2019-02-17 11:49

Actually you can just use: df.toLocalIterator, here is the reference in Spark source code:

/**
 * Return an iterator that contains all of [[Row]]s in this Dataset.
 *
 * The iterator will consume as much memory as the largest partition in this Dataset.
 *
 * Note: this results in multiple Spark jobs, and if the input Dataset is the result
 * of a wide transformation (e.g. join with different partitioners), to avoid
 * recomputing the input Dataset should be cached first.
 *
 * @group action
 * @since 2.0.0
 */
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
  queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
  }
}

0人赞添加讨论(0) 举报

How to get an Iterator of Rows using Dataframe in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间