How do I select item with most count in a datafram

2019-08-21 17:58发布

问题:

For example, if I have a dataframe as below, I want to do something like val top_src_ip = 58.242.83.11, but I don't want to fix this number. I want it to be a variable based on a dataframe. What is the command to do it?

+--------------+------------+
|        src_ip|src_ip_count|
+--------------+------------+
|  58.242.83.11|          52|
|58.218.198.160|          33|
|58.218.198.175|          22|
|221.194.47.221|           6|

回答1:

As in my answer here you can use argmax to get the relevant value:

import org.apache.spark.sql.functions._
val newDF = df.agg(max(struct('src_ip_count, 'src_ip)) as 'tmp).select($"tmp.src_ip")

The above creates the result in a dataframe, to use it as a variable, you should simply get the head (there would be just one element) and get the relevant column (I assume src_ip is a string):

val top_src_ip = newDF.head.getAs[String](0)