Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
but I really wanted the year
as Int
(and perhaps transform some other columns).
The best I could come up with is
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
which is a bit convoluted.
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
I'm likely missing something, since there should be a better way to do this in spark/scala...
[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on
withColumn
,withColumnRenamed
andcast
put forward by msemelman, Martin Senne and others are simpler and cleaner].I think your approach is ok, recall that a Spark
DataFrame
is an (immutable) RDD of Rows, so we're never really replacing a column, just creating newDataFrame
each time with a new schema.Assuming you have an original df with the following schema:
And some UDF's defined on one or several columns:
Changing column types or even building a new DataFrame from another can be written like this:
which yields:
This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate
udf val
s make the code more readable and re-usable.As the
cast
operation is available for SparkColumn
's (and as I personally do not favourudf
's as proposed by @Svend
at this point), how about:to cast to the requested type? As a neat side effect, values not castable / "convertable" in that sense, will become
null
.In case you need this as a helper method, use:
which is used like:
This method will drop the old column and create new columns with same values and new datatype. My original datatypes when the DataFrame was created were:-
After this I ran following code to change the datatype:-
After this my result came out to be:-
So this only really works if your having issues saving to a jdbc driver like sqlserver, but it's really helpful for errors you will run into with syntax and types.
Generate a simple dataset containing five values and convert
int
tostring
type:You can use below code.
Which will convert year column to
IntegerType
column.