I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map
, flatMap
, etc..
here is my sample code:
df.select("COUNTY","VEHICLES").show();
this is my dataframe
and i need to convert this dataframe
to RDD
and operate some RDD operations on this new RDD.
Here is code how i am converted dataframe to RDD
RDD<Row> java = df.select("COUNTY","VEHICLES").rdd();
after converting to RDD, i am not able to see the RDD results, i tried
java.collect();
java.take(10);
java.foreach();
In all above cases i failed to get results.
please help me out.
val myRdd : RDD[String] = ds.rdd
Check the Spark Api documentation Dataset to RDD.lazy val
rdd: RDD[T]
In your case create the Dataframe with selected of record by performing select after that call .rdd
it wil convert it to RDD
Since spark 2.0 you can convert DataFrame to DataSet using toDS
function in order to use RDD operations.
Recommend this great article about mastering spark 2.0
For Spark 1.6 :
You won't be able to see the result's as when you are converting a Dataframe
to a RDD what it does is it converts it into RDD[Row]
And hence when you try any of these :
java.collect();
java.take(10);
java.foreach();
It would be resulting in Array[Row]
and you are not able to get the results.
Solution:
You can convert the Row to respective values and get the RDD
out of it like here :
val newDF=df.select("COUNTY","VEHICLES")
val resultantRDD=newDF.rdd.map{row=>
val county=row.getAs[String]("COUNTY")
val vehicles=row.getAs[String]("VEHICLES")
(county,vehicles)
}
And now you can apply the foreach
and collect
function to get the value.
P.S.: The code is written in Scala , but you can get the essence of what I am trying to do !
Try persisting the rdd before reading the data from rdd.
val finalRdd = mbnfinal.rdd
finalRdd.cache()
finalRdd.count()