How to calculate the number of rows of a dataframe

2019-08-29 12:07发布

问题:

This question already has an answer here:

  • Count on Spark Dataframe is extremely slow 2 answers
  • Getting the count of records in a data frame quickly 2 answers

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

回答1:

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.