compute string length in Spark SQL DSL

2019-06-22 14:03发布

Edit: this is an old question concerning Spark 1.2

I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.

My first attempt has been to use the integrated relational queries, for instance

notes.select('note).orderBy(length('note))

with no luck at the compilation:

error: not found: value length

(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)

Then I tried

sql("SELECT note, length(note) as len FROM notes")

This fails with

java.util.NoSuchElementException: key not found: length

(Then I reread this (I'm running 1.2.0) http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features and wonder in what sense Spark SQL supports the listed hive features.)

Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)

Thanks!

2条回答
做自己的国王
2楼-- · 2019-06-22 14:27

It now exists!

Your spark.sql("SELECT note, LENGTH(note) as len FROM notes") should work.

I'm running Spark 2.2.0, just did it and it worked.

查看更多
冷血范
3楼-- · 2019-06-22 14:28

Try this in Spark Shell:

case class Note(id:Int,text:String)
val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
val notesRdd=sc.parallelize(notes)
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
import hc.createSchemaRDD
notesRdd.registerTempTable("note")
hc.sql("select id, text, length(text) from note").foreach(println)

It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):

[2,Two,3]
[1,One,3]
[3,Three,5]
查看更多
登录 后发表回答