calculating first quartile for a numeric column in

2020-03-18 03:04发布

问题:

I'm a newbie in spark/scala. This is what I'm doing to calculate the first quartile of a csv file

val column= sc.textFile("test.txt").map(_.split(",")(2)).flatMap(_.split(",")).map((_.toDouble))
val total = column.count.toDouble
val upper=(total+1)/4
val upper2= scala.math.ceil(upper).toInt

I'm not really sure how to sort the column other than adding a Key Value Pair. all I need is to take the last 2 values for the quartiles, after they are sorted. But i'm forced to create a key value pair.

val quartiles = column.map((_,1)).sortByKey(true).take(upper2)
val first_quartile =0
if(upper % upper.toInt >0){
   first_quartile = quartiles(upper.toInt-1) 
}else{
   first_quartile = (quartiles(upper2-1) +(quartiles(upper2-2))/2
}

This works, but it will leave me with an annoying key value pair. how do i revert back to just 1 column, instead of 2 (e.g. the key value pair)

回答1:

Was just doing this myself. I started out writing a function to compute the median, but found it was faster and easier to get quantiles by casting my RDD as a DataFrame and querying it with SQL. Here's a self-contained example:

  // construct example RDD
  val rows = Seq(3, 1, 5, 1, 9, 2, 2)
  val rdd = sc.parallelize(rows)

  // construct Dataframe
  case class MedianDF(value: Long)
  val df = rdd.map(row => MedianDF(row.toLong)).toDF 

  // register the table and then query for your desired percentile
  df.registerTempTable("table")
  sqlContext.sql("SELECT PERCENTILE(value, 0.5) FROM table").show()

Which returns 2, the median. Similarly, if you want the first quartile just pass 0.25 to PERCENTILE:

sqlContext.sql("SELECT PERCENTILE(value, 0.25) FROM table").show()