How to parse a csv string into a Spark dataframe u

2020-04-23 05:31发布

问题:

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.

"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....

The schema line is not inside the same RDD, but in a another variable:

val header = "name,account,state,age"

So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.

I did search and saw a post: Can I read a CSV represented as a string into Apache Spark using spark-csv . However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.

Your help is greatly appreciated.

回答1:

The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.


When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:

val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._

Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).

Assuming the input data are in rdd:

val header = "name,account,state,age"

val df = rdd.map(row => row.split(","))
  .map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
  .toDF(header.split(","):_*)

Resulting dataframe:

+----+-----------+-----+---+
|name|    account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330|   NY| 34|
|Kate|3333-544444|   LA| 32|
|Abby|4444-234324|   MA| 56|
+----+-----------+-----+---+