How to change column types in Spark SQL's Data

2019-01-03 11:36发布

Suppose I'm doing something like:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment                
1997 Ford  E350  Go get one now th...  

but I really wanted the year as Int (and perhaps transform some other columns).

The best I could come up with is

df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

which is a bit convoluted.

I'm coming from R, and I'm used to being able to write, e.g.

df2 <- df %>%
   mutate(year = year %>% as.integer, 
          make = make %>% toupper)

I'm likely missing something, since there should be a better way to do this in spark/scala...

16条回答
家丑人穷心不美
2楼-- · 2019-01-03 12:12

One can change data type of a column by using cast in spark sql. table name is table and it has two columns only column1 and column2 and column1 data type is to be changed. ex-spark.sql("select cast(column1 as Double) column1NewName,column2 from table") In the place of double write your data type.

查看更多
啃猪蹄的小仙女
3楼-- · 2019-01-03 12:13

the answers suggesting to use cast, FYI, the cast method in spark 1.4.1 is broken.

for example, a dataframe with a string column having value "8182175552014127960" when casted to bigint has value "8182175552014128100"

    df.show
+-------------------+
|                  a|
+-------------------+
|8182175552014127960|
+-------------------+

    df.selectExpr("cast(a as bigint) a").show
+-------------------+
|                  a|
+-------------------+
|8182175552014128100|
+-------------------+

We had to face a lot of issue before finding this bug because we had bigint columns in production.

查看更多
Animai°情兽
4楼-- · 2019-01-03 12:14

You can use selectExpr to make it a little cleaner:

df.selectExpr("cast(year as int) as year", "upper(make) as make",
    "model", "comment", "blank")
查看更多
甜甜的少女心
5楼-- · 2019-01-03 12:17
    val fact_df = df.select($"data"(30) as "TopicTypeId", $"data"(31) as "TopicId",$"data"(21).cast(FloatType).as( "Data_Value_Std_Err")).rdd
    //Schema to be applied to the table
    val fact_schema = (new StructType).add("TopicTypeId", StringType).add("TopicId", StringType).add("Data_Value_Std_Err", FloatType)

    val fact_table = sqlContext.createDataFrame(fact_df, fact_schema).dropDuplicates()
查看更多
成全新的幸福
6楼-- · 2019-01-03 12:18

First if you wanna cast type

import org.apache.spark.sql
df.withColumn("year", $"year".cast(sql.types.IntegerType))

With same column name, the column will be replaced with new one, you don't need to add and delete.

Second, about Scala vs R. the Scala code most similar to R that I can achieve :

val df2 = df.select(
   df.columns.map {
     case year @ "year" => df(year).cast(IntegerType).as(year)
     case make @ "make" => functions.upper(df(make)).as(make)
     case other         => df(other)
   }: _*
)

Though the length is a little longer than R's. Note that the mutate is a function for R data frame , so Scala is very good enough in expressive power given without using a special function .

(df.columns is surprisingly a Array[String] instead of Array[Column], maybe they want it look like Python pandas's dataframe.)

查看更多
疯言疯语
7楼-- · 2019-01-03 12:20
df.select($"long_col".cast(IntegerType).as("int_col"))
查看更多
登录 后发表回答