Is there a Spark built-in that flattens nested arr

2019-08-25 07:06发布

I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.

def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {

    def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
        case null => Seq.empty[String]
        case _ => seqOfSeq.flatten
    }
    df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}

My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:

df.transform(combineSentences(inCol, outCol))

Is there a Spark built-in function that does the same thing? I have not been able to find one.

标签： scala apache-spark apache-spark-sql user-defined-functions

1条回答

混吃等死

2楼-- · 2019-08-25 07:56

There is a similar function (since Spark 2.4) and it is called flatten:

import org.apache.spark.sql.functions.flatten

From the official documentation:

def flatten(e: Column): Column

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

Since

2.4.0

To get the exact equivalent you'll have to coalesce to replace NULL.

0人赞添加讨论(0) 举报

Is there a Spark built-in that flattens nested arr

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间