Transform Java-Pair-Rdd to Rdd

2019-07-13 10:18发布

I need to transform my Java-pair-rdd to a csv :

so i m thinking to transform it to rdd, to solve my problem.

what i want is to have my rdd transformed from :

Key   Value
Jack  [a,b,c]

to :

Key  value
Jack  a
Jack  b
Jack  c

i see that it is possible in that issue and in this issue(PySpark: Convert a pair RDD back to a regular RDD) so i am asking how to do that in java?

Update of question

The Type of my JavaPairRdd is of Type :

JavaPairRDD<Tuple2<String,String>, Iterable<Tuple1<String>>>

and this is the form of row that contain :

((dr5rvey,dr5ruku),[(2,01/09/2013 00:09,01/09/2013 00:27,N,1,-73.9287262,40.75831223,-73.98726654,40.76442719,2,3.96,16,0.5,0.5,4.25,0,,21.25,1,)])

the key here is : (dr5rvey,dr5ruku) and the value is [(2,01/09/2013 00:09,01/09/2013 00:27,N,1,-73.9287262,40.75831223,-73.98726654,40.76442719,2,3.96,16,0.5,0.5,4.25,0,,21.25,1,)]

my original JavaRdd was of type:

JavaRDD<String>

3条回答
我只想做你的唯一
2楼-- · 2019-07-13 10:49

Understanding that the keys should be kept, you may use flatMapValues function :

Pass each value in the key-value pair RDD through a flatMap function without changing the keys; ...

JavaPairRDD<Tuple2<String,String>, Iterable<Tuple1<String>>> input = ...;
JavaPairRDD<Tuple2<String, String>, Tuple1<String>> output1 = input.flatMapValues(iter -> iter);
JavaPairRDD<Tuple2<String, String>, String> output2 = output1.mapValues(t1 -> t1._1());
查看更多
SAY GOODBYE
3楼-- · 2019-07-13 10:58

The type of your RDD is RDD[(String, Array[String])] if I am getting this right. So you can just apply flatMap on this RDD.

val rdd: RDD[(String, Array[String])] = ???
val newRDD = rdd.flatMap{case (key, array) => array.map(value => (key, value))}

newRDD will be of type RDD[(String, String)]

查看更多
你好瞎i
4楼-- · 2019-07-13 11:06

If I understand correctly you need to use the function flat map, it enables you to create multiple rows from a single key, example in scala(just the idea youll need to change for your use case):

rdd.flatMap(arg0 => {
        var list = List[Row]()
        list = arg0._2.split(",")
        list
    })

Its a super simplified example but you should get the gist.

for rdd:

key      val
mykey   "a,b,c'

the returned rdd will be:

key      val
mykey   "a"
mykey   "b"
mykey   "c"
查看更多
登录 后发表回答