How to create DataFrame from Scala's List of I

2019-01-13 06:37发布

问题:

I have the following Scala value:

val values: List[Iterable[Any]] = Traces().evaluate(features).toList

and I want to convert it to a DataFrame.

When I try the following:

sqlContext.createDataFrame(values)

I got this error:

error: overloaded method value createDataFrame with alternatives:

[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame 
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
          sqlContext.createDataFrame(values)

Why?

回答1:

Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD. Here is an example with Spark 2.0 but it exists in older versions too

import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)

val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()

Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:

val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF

Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1 The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.



回答2:

As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.

To convert List[Iterable[Any]] to List[Row], we can say

val rows = values.map{x => Row(x:_*)}

and then having schema like schema, we can make RDD

val rdd = sparkContext.makeRDD[RDD](rows)

and finally create a spark data frame

val df = sqlContext.createDataFrame(rdd, schema)


回答3:

Simplest approach:

val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")


回答4:

In Spark 2 we can use DataSet by just converting list to DS by toDS API

val ds = list.flatMap(_.split(",")).toDS() // Records split by comma 

or

val ds = list.toDS()

This more convenient than rdd or df



回答5:

The most concise way I've found:

val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))