Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
but I really wanted the year
as Int
(and perhaps transform some other columns).
The best I could come up with is
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
which is a bit convoluted.
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
I'm likely missing something, since there should be a better way to do this in spark/scala...
One can change data type of a column by using cast in spark sql. table name is table and it has two columns only column1 and column2 and column1 data type is to be changed. ex-spark.sql("select cast(column1 as Double) column1NewName,column2 from table") In the place of double write your data type.
the answers suggesting to use cast, FYI, the cast method in spark 1.4.1 is broken.
for example, a dataframe with a string column having value "8182175552014127960" when casted to bigint has value "8182175552014128100"
We had to face a lot of issue before finding this bug because we had bigint columns in production.
You can use
selectExpr
to make it a little cleaner:First if you wanna cast type
With same column name, the column will be replaced with new one, you don't need to add and delete.
Second, about Scala vs R. the Scala code most similar to R that I can achieve :
Though the length is a little longer than R's. Note that the
mutate
is a function for R data frame , so Scala is very good enough in expressive power given without using a special function .(
df.columns
is surprisingly a Array[String] instead of Array[Column], maybe they want it look like Python pandas's dataframe.)