How to calculate the number of rows of a dataframe

2019-08-29 12:07发布

站内文章 / Spark

22 0

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This question already has an answer here:

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

标签： apache-spark pyspark apache-spark-sql

做自己的国王

女 | 书童

私信

Ta的文章更多文章

0条评论

还没有人评论过~