collect on a dataframe spark

2019-08-18 07:18发布

问题:

I wrote this:

df.select(col("colname")).distinct().collect.map(_.toString()).toList

the result is

List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]")

Whereas I want to get :

List("2019-06-24", "2019-06-22", "2019-06-23")

How to change this please

回答1:

You need to change .map(_.toString()) to .map(_.getAs[String]("colname")).

With .map(_.toString()), you are calling org.apache.spark.sql.Row.toString, that's why the output is like List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]").

Correct way is:
val list = df.select("colname").distinct().collect().map(_.getAs[String]("colname")).toList

Output will be:

List("2019-06-24", "2019-06-22", "2019-06-23")


回答2:

Sample data:

val df=sc.parallelize(Seq(("2019-06-24"),( "2019-06-22"),("2019-06-23"))).toDF("cn")

Now select column then apply map to get first index value then add quotes and convert to string.

df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString)
//res36: Array[String] = Array("2019-06-24", "2019-06-22", "2019-06-23")

(or)

df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString).toList
//res37: List[String] = List("2019-06-24", "2019-06-22", "2019-06-23")