Getting the count of records in a data frame quick

2020-04-02 02:50发布

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.

标签： scala apache-spark hadoop-streaming

2条回答

2楼-- · 2020-04-02 03:25

file.groupBy("<column-name>").count().show()

0人赞添加讨论(0) 举报

3楼-- · 2020-04-02 03:37

It's going to take so much time anyway. At least the first time.

One way is to cache the dataframe, so you will be able to more with it, other than count.

E.g

df.cache()
df.count()

Subsequent operations don't take much time.

0人赞添加讨论(0) 举报