How to groupby and aggregate multiple fields using

I am new to Apache Spark as well as Scala, currently learning this framework and programming language for big data. I have a sample file I am trying to find out for a given field total number of another field and its count and list of values from another field. I tried on my own and seems that i am not writing in better approach in spark rdd (as starting).

Please find the below sample data (Customerid: Int, Orderid: Int, Amount: Float):

44,8602,37.19
35,5368,65.89
2,3391,40.64
47,6694,14.98
29,680,13.08
91,8900,24.59
70,3959,68.68
85,1733,28.53
53,9900,83.55
14,1505,4.32
51,3378,19.80
42,6926,57.77
2,4424,55.77
79,9291,33.17
50,3901,23.57
20,6633,6.49
15,6148,65.53
44,8331,99.19
5,3505,64.18
48,5539,32.42

My current code:

((sc.textFile("file://../customer-orders.csv").map(x => x.split(",")).map(x => (x(0).toInt,x(1).toInt)).map{case(x,y) => (x, List(y))}.reduceByKey(_ ++ _).sortBy(_._1,true)).
fullOuterJoin(sc.textFile("file://../customer-orders.csv").map(x =>x.split(",")).map(x => (x(0).toInt,x(2).toFloat)).reduceByKey((x,y) => (x + y)).sortBy(_._1,true))).
fullOuterJoin(sc.textFile("file://../customer-orders.csv").map(x =>x.split(",")).map(x => (x(0).toInt)).map(x => (x,1)).reduceByKey((x,y) => (x + y)).sortBy(_._1,true)).sortBy(_._1,true).take(50).foreach(println)

Got a result like this:

(49,(Some((Some(List(8558, 6986, 686....)),Some(4394.5996))),Some(96)))

Expecting result like:

customerid, (orderids,..,..,....), totalamount, number of orderids

Is there any better approach? I just tried combineByKey with the below code but the println inside are not printing.

scala> val reduced = inputrdd.combineByKey(
 | (mark) => {
 | println(s"Create combiner -> ${mark}")
 | (mark, 1)
 | },
 | (acc: (Int, Int), v) => {
 | println(s"""Merge value : (${acc._1} + ${v}, ${acc._2} + 1)""")
 | (acc._1 + v, acc._2 + 1)
 | },
 | (acc1: (Int, Int), acc2: (Int, Int)) => {
 | println(s"""Merge Combiner : (${acc1._1} + ${acc2._1}, ${acc1._2} + ${acc2._2})""")
 | (acc1._1 + acc2._1, acc1._2 + acc2._2)
 | }
 | )
reduced: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[27] at combineByKey at <console>:29

scala> reduced.collect()
res5: Array[(String, (Int, Int))] = Array((maths,(110,2)), (physics,(214,3)), (english,(65,1)))

I am using Spark version 2.2.0 , Scala 2.11.8 and Java 1.8 build 101

标签： scala apache-spark group-by rdd apache-spark-mllib

1条回答

beautiful°

2楼-- · 2019-06-07 01:14

This is much easier to solve using the newer DataFrame API. First read the csv file and add the column names:

val df = spark.read.csv("file://../customer-orders.csv").toDF("Customerid", "Orderid", "Amount")

Then use groupBy and agg to make the aggregations (here you want collect_list, sum and count):

val df2 = df.groupBy("Customerid").agg(
    collect_list($"Orderid") as "Orderids", 
    sum($"Amount") as "TotalAmount",
    count($"Orderid") as "NumberOfOrderIds"
)

Resulting dataframe using the provided input example:

+----------+------------+-----------+----------------+
|Customerid|    Orderids|TotalAmount|NumberOfOrderIds|
+----------+------------+-----------+----------------+
|        51|      [3378]|       19.8|               1|
|        15|      [6148]|      65.53|               1|
|        29|       [680]|      13.08|               1|
|        42|      [6926]|      57.77|               1|
|        85|      [1733]|      28.53|               1|
|        35|      [5368]|      65.89|               1|
|        47|      [6694]|      14.98|               1|
|         5|      [3505]|      64.18|               1|
|        70|      [3959]|      68.68|               1|
|        44|[8602, 8331]|     136.38|               2|
|        53|      [9900]|      83.55|               1|
|        48|      [5539]|      32.42|               1|
|        79|      [9291]|      33.17|               1|
|        20|      [6633]|       6.49|               1|
|        14|      [1505]|       4.32|               1|
|        91|      [8900]|      24.59|               1|
|         2|[3391, 4424]|      96.41|               2|
|        50|      [3901]|      23.57|               1|
+----------+------------+-----------+----------------+

If you want to work with the data as a RDD after these transformations, you can convert it afterwards:

val rdd = df2.as[(Int, Seq[Int], Float, Int)].rdd

Of course, it is possible to solve using RDDs directly as well. Use aggregateByKey:

val rdd = spark.sparkContext
  .textFile("test.csv")
  .map(x => x.split(","))
  .map(x => (x(0).toInt, (x(1).toInt, x(2).toFloat)))

val res = rdd.aggregateByKey((Seq[Int](), 0.0, 0))(
    (acc, xs) => (acc._1 ++ Seq(xs._1), acc._2 + xs._2, acc._3 + 1), 
    (acc1, acc2) => (acc1._1 ++ acc2._1, acc1._2 + acc2._2, acc1._3 + acc2._3))

This is harder to read but will give the same result as the dataframe approach above.

0人赞添加讨论(0) 举报

How to groupby and aggregate multiple fields using

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间