perform RDD operations on DataFrames

2019-09-06 03:43发布

问题:

I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map, flatMap, etc..

here is my sample code:

df.select("COUNTY","VEHICLES").show();

this is my dataframe and i need to convert this dataframe to RDD and operate some RDD operations on this new RDD.

Here is code how i am converted dataframe to RDD

 RDD<Row> java = df.select("COUNTY","VEHICLES").rdd();

after converting to RDD, i am not able to see the RDD results, i tried

java.collect();
java.take(10);
java.foreach();

In all above cases i failed to get results.

please help me out.

回答1:

val myRdd : RDD[String] = ds.rdd

Check the Spark Api documentation Dataset to RDD.lazy val rdd: RDD[T]

In your case create the Dataframe with selected of record by performing select after that call .rdd it wil convert it to RDD



回答2:

Since spark 2.0 you can convert DataFrame to DataSet using toDS function in order to use RDD operations.
Recommend this great article about mastering spark 2.0



回答3:

For Spark 1.6 :

You won't be able to see the result's as when you are converting a Dataframe to a RDD what it does is it converts it into RDD[Row]

And hence when you try any of these :

java.collect();
java.take(10);
java.foreach();

It would be resulting in Array[Row] and you are not able to get the results.

Solution:

You can convert the Row to respective values and get the RDD out of it like here :

val newDF=df.select("COUNTY","VEHICLES")
val resultantRDD=newDF.rdd.map{row=>
val county=row.getAs[String]("COUNTY")
val vehicles=row.getAs[String]("VEHICLES")
(county,vehicles)
}

And now you can apply the foreach and collect function to get the value.

P.S.: The code is written in Scala , but you can get the essence of what I am trying to do !



回答4:

Try persisting the rdd before reading the data from rdd.

val finalRdd = mbnfinal.rdd
finalRdd.cache()
finalRdd.count()