I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
Schema Approach
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!
Update - as of Spark 1.6, you can simply use the built-in csv data source:
You can also use various options to control the CSV parsing, e.g.:
For Spark version < 1.6: The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (
;
), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
You will not able to convert it into data frame until you use implicit conversion.
After this only you can convert this to data frame
I know I am quite late to answer this but I have come up with a different answer:
I have given different ways to create DataFrame from text file
raw text file
spark session without schema
spark session with schema
using sql context
If you want to use the
toDF
method, you have to convert yourRDD
ofArray[String]
into aRDD
of a case class. For example, you have to do: