How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
) to a Dataframe org.apache.spark.sql.DataFrame
. I converted a dataframe to rdd using .rdd
. After processing it I want it back in dataframe. How can I do this ?
相关问题
- How to maintain order of key-value in DataFrame sa
- Unusual use of the new keyword
- Get Runtime Type picked by implicit evidence
- Spark on Yarn Container Failure
- What's the point of nonfinal singleton objects
相关文章
- Gatling拓展插件开发,check(bodyString.saveAs("key"))怎么实现
- Livy Server: return a dataframe as JSON?
- RDF libraries for Scala [closed]
- Why is my Dispatching on Actors scaled down in Akk
- How do you run cucumber with Scala 2.11 and sbt 0.
- GRPC: make high-throughput client in Java/Scala
- Setting up multiple test folders in a SBT project
- SQL query Frequency Distribution matrix for produc
This code works perfectly from Spark 2.x with Scala 2.11
Import necessary classes
Create
SparkSession
Object, Here it'sspark
Let's an
RDD
to make itDataFrame
Method 1
Using
SparkSession.createDataFrame(RDD obj)
.Method 2
Using
SparkSession.createDataFrame(RDD obj)
and specifying column names.Method 3 (Actual answer to question)
This way requires the input
rdd
should be of typeRDD[Row]
.create the schema
Now apply both
rowsRdd
andschema
tocreateDataFrame()
Here is a simple example of converting your List into Spark RDD and then converting that Spark RDD into Dataframe.
Please note that I have used Spark-shell's scala REPL to execute following code, Here sc is an instance of SparkContext which is implicitly available in Spark-shell. Hope it answer your question.
Method 1: (Scala)
Method 2: (Scala)
Method 1: (Python)
Method 2: (Python)
Extracted the value from the row object and then applied the case class to convert rdd to DF
Suppose you have a
DataFrame
and you want to do some modification on the fields data by converting it toRDD[Row]
.To convert back to
DataFrame
fromRDD
we need to define the structure type of theRDD
.If the datatype was
Long
then it will become asLongType
in structure.If
String
thenStringType
in structure.Now you can convert the RDD to DataFrame using the createDataFrame method.
Assuming your RDD[row] is called rdd, you can use:
Note: This answer was originally posted here
I am posting this answer because I would like to share additional details about the available options that I did not find in the other answers
To create a DataFrame from an RDD of Rows, there are two main options:
1) As already pointed out, you could use
toDF()
which can be imported byimport sqlContext.implicits._
. However, this approach only works for the following types of RDDs:RDD[Int]
RDD[Long]
RDD[String]
RDD[T <: scala.Product]
(source: Scaladoc of the
SQLContext.implicits
object)The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of
scala.Product
).So, to use this approach for an
RDD[Row]
, you have to map it to anRDD[T <: scala.Product]
. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:or
The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:
2) You can use
createDataFrame(rowRDD: RDD[Row], schema: StructType)
as in the accepted answer, which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of
StructType
class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.