How to pass column names in selectExpr through one

2020-04-20 12:06发布

问题:

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as ==> mismatched input ',' expecting

Below is the piece of code I am trying to parameterize.

var filteredMicroBatchDF=microBatchOutputDF
.selectExpr("col1","col2","struct(offset,KAFKA_TS) as otherCols" )
.groupBy("col1","col2").agg(max("otherCols").as("latest"))
.selectExpr("col1","col2","latest.*")

Reference to the script I am trying to emulate: - https://docs.databricks.com/_static/notebooks/merge-in-cdc.html

I have tried like below by passing column names in a variable and then reading in the selectExpr from these variables: -

val keyCols = "col1","col2"
val structCols = "struct(offset,KAFKA_TS) as otherCols" 

var filteredMicroBatchDF=microBatchOutputDF
.selectExpr(keyCols,structCols )
.groupBy(keyCols).agg(max("otherCols").as("latest"))
.selectExpr(keyCols,"latest.*")

When I run the script it gives me error as org.apache.spark.sql.streaming.StreamingQueryException: mismatched input ',' expecting <<EOF>>

EDIT

Here is what I have tried after comments by Luis Miguel which works fine: -

import org.apache.spark.sql.{DataFrame, functions => sqlfun}

def foo(microBatchOutputDF: DataFrame)
       (keyCols: Seq[String], structCols: Seq[String]): DataFrame =
  microBatchOutputDF
    .selectExpr((keyCols ++ structCols) : _*)
    .groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
    .selectExpr((keyCols :+ "latest.*") : _*)

var keyColumns = Seq("COL1","COL2")
var structColumns = "offset,Kafka_TS"

foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))

Note: Below results in an error

foo(microBatchOutputDF)(keyCols = Seq(keyColumns), structColumns = Seq("struct("+structColumns+") as otherCols"))

The thing about above working code is that, here keyColumns were hardcoded. So, I tried reading (firstly) from parameter file and (Secondly) from widget which resulted in error and it is here I am looking for advice and suggestions: -

First Method

def loadProperties(url: String):Properties = {
    val properties: Properties = new Properties()
    if (url != null) {
      val source = Source.fromURL(url)
      properties.load(source.bufferedReader())
    }
  return properties
}
var tableProp: Properties = new Properties()
tableProp = loadProperties("dbfs:/Configs/Databricks/Properties/table/Table.properties") 
var keyColumns = Seq(tableProp.getProperty("keyCols"))
var structColumns = tableProp.getProperty("structCols")

keyCols and StructCols are defined in parameter file as: -

keyCols = Col1, Col2 (I also tried assigning these as "Col1","Col2")
StructCols = offset,Kafka_TS

Then finally,

foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))

The code is throwing the error pointing at first comma (as if its taking the columns field as single argument):
mismatched input ',' expecting <EOF>
== SQL ==
"COL1","COL2""
-----^^^

If I pass just one column in the keyCols property, code is working fine.
E.g. keyCols = Col1

Second Method
Here I tried reading key columns from the widget and its the same error again.

dbutils.widgets.text("prmKeyCols", "","") 
val prmKeyCols = dbutils.widgets.get("prmKeyCols") 
var keyColumns = Seq(prmKeyCols)

The widget is passed in as below
"Col1","Col2"

Then finally,

foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))

This is also giving same error.

回答1:

Something like this should work:

import org.apache.spark.sql.{DataFrame, functions => sqlfun}

def foo(microBatchOutputDF: DataFrame)
       (keyCols: Seq[String], structCols: Seq[String]): DataFrame =
  microBatchOutputDF
    .selectExpr((keyCols ++ structCols) : _*)
    .groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
    .selectExpr((keyCols :+ "latest.*") : _*)

Which you can use like:

foo(microBatchOutputDF)(keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols"))