This question already has an answer here:
-
Count on Spark Dataframe is extremely slow
2 answers
-
Getting the count of records in a data frame quickly
2 answers
I have a very large pyspark dataframe and I would calculate the number of row, but count()
method is too slow. Is there any other faster method?
If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:
>>> df = spark.range(10)
>>> df.sample(0.5).count()
4
In this case, you would scale the count()
results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.