Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
but I really wanted the year
as Int
(and perhaps transform some other columns).
The best I could come up with is
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
which is a bit convoluted.
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
I'm likely missing something, since there should be a better way to do this in spark/scala...
Since Spark version 1.4 you can apply the cast method with DataType on the column:
If you are using sql expressions you can also do:
For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame
Java code for modifying the datatype of the DataFrame from String to Integer
It will simply cast the existing(String datatype) to Integer.
To convert the year from string to int, you can add the following option to the csv reader: "inferSchema" -> "true", see DataBricks documentation
Another way: