How take a random row from a PySpark DataFrame?

2020-05-23 03:07发布

站内文章 / Python

51 0

问题:

How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row.

On RRD there is a method takeSample() that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?

回答1:

You can simply call takeSample on a RDD:

df = sqlContext.createDataFrame(
    [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]

If you don't want to collect you can simply take a higher fraction and limit:

df.sample(False, 0.1, seed=0).limit(1)

标签： python apache-spark dataframe pyspark apache-spark-sql

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

How take a random row from a PySpark DataFrame?

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮