Getting the count of records in a data frame quick

2020-04-02 02:40发布

站内文章 / Spark

39 0

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.

It's going to take so much time anyway. At least the first time.

One way is to cache the dataframe, so you will be able to more with it, other than count.

E.g

df.cache()
df.count()

Subsequent operations don't take much time.

file.groupBy("<column-name>").count().show()

标签： scala apache-spark hadoop-streaming

老娘就宠你

女 | 书童

私信

Ta的文章更多文章

0条评论

还没有人评论过~