Is there a Spark built-in that flattens nested arr

2019-08-25 07:06发布

I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.

def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {

    def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
        case null => Seq.empty[String]
        case _ => seqOfSeq.flatten
    }
    df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}

My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:

df.transform(combineSentences(inCol, outCol))

Is there a Spark built-in function that does the same thing? I have not been able to find one.

1条回答
混吃等死
2楼-- · 2019-08-25 07:56

There is a similar function (since Spark 2.4) and it is called flatten:

import org.apache.spark.sql.functions.flatten

From the official documentation:

def flatten(e: Column): Column

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

Since

2.4.0

To get the exact equivalent you'll have to coalesce to replace NULL.

查看更多
登录 后发表回答