The RDD has been created in the format Array[Array[String]]
and has the following values:
val rdd : Array[Array[String]] = Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),
Array("4580056797", "0", "2015-07-29 10:38:43", "0", "1", "1"))
I want to create a dataFrame with the schema :
val schemaString = "callId oCallId callTime duration calltype swId"
Next steps:
scala> val rowRDD = rdd.map(p => Array(p(0), p(1), p(2),p(3),p(4),p(5).trim))
rowRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:39
scala> val calDF = sqlContext.createDataFrame(rowRDD, schema)
Gives the following error:
console:45: error: overloaded method value createDataFrame with alternatives:
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[Array[String]],
org.apache.spark.sql.types.StructType)
val calDF = sqlContext.createDataFrame(rowRDD, schema)
Just paste into a
spark-shell
:Then
map()
over the RDD to create instances of the case class, and then create the DataFrame usingtoDF()
:This infers the schema from the case class.
Then you can proceed with:
If you want to use
toDF()
in a normal program (not in thespark-shell
), make sure (quoted from here):import sqlContext.implicits._
right after creating theSQLContext
toDF()
You need to convert first you
Array
intoRow
and then define schema. I made assumption that most of your fields areLong
Using
spark 1.6.1
andscala 2.10
I got the same error
error: overloaded method value createDataFrame with alternatives:
For me, gotcha was the signature in
createDataFrame
, I was trying to use theval rdd : List[Row]
, but it failed becausejava.util.List[org.apache.spark.sql.Row]
andscala.collection.immutable.List[org.apache.spark.sql.Row]
are NOT the same.The working solution I've found is I would convert
val rdd : Array[Array[String]]
intoRDD[Row]
viaList[Array[String]]
. I find this is the closest to what's in the documentationI assume that your
schema
is, like in the Spark Guide, as follow:If you look at the signature of the createDataFrame, here is the one that accepts a StructType as 2nd argument (for Scala)
So it accepts as 1st argument a
RDD[Row]
. What you have inrowRDD
is aRDD[Array[String]]
so there is a mismatch.Do you need an
RDD[Array[String]]
?Otherwise you can use the following to create your dataframe: