How to remove duplicate columns with their non-nul

2019-08-03 20:50发布

问题:

This question already has an answer here:

  • How to avoid duplicate columns after join? 6 answers

I have a dataframe named "A" with 300+ columns in it and i am trying to join the dataframe named "A" with its incremental data "B" with same schema as "A".

After joining the dataframes, i am getting duplicate columns. That i was avoiding by using co

val toPrint = udf((value1: String, value2: String) => if(value1 != null) {value1} else value2)
val dfClean = df1.join(df2, df1("PERIOD_TAG") === df2("PERIOD_TAG"), "fullouter").select(toPrint(df1("PERIOD_SHORT_DESCRIPTION"),df2("PERIOD_SHORT_DESCRIPTION")).alias("PERIOD_SHORT_DESCRIPTION"),toPrint(df1("PERIOD_TAG"),df2("PERIOD_TAG")).alias("PERIOD_TAG"))....so on for all the columns

I am calling a UDF to select the most updated value(from incremental file) among the duplicate columns. The incremental data will have few updated data which i need to add along with all new data in incremantal dataframe and also old data of dataframe "B".

Is there any another way to avoid selecting columns individually and use a for loop for it. Or is there any way that after joining, i get the new/updated value of my incremental df and old values of dataframe "B" which are not present in dataframe "A".

回答1:

I'd first avoid the duplication in join column names using single-string usingColumn argument of join operator.

df1.join(df2, "PERIOD_TAG", "fullouter")

That takes care of de-duplicating PERIOD_TAG column.

Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

The last step is to use coalesce function:

coalesce(e: Column*): Column Returns the first column that is not null, or null if all inputs are null.

That looks like your case exactly and avoids dealing with 300+ columns.

val myCol = coalesce($"df1.one", $"df2.one") as "one"
df1.join(df2, "PERIOD_TAG", "inner").
  select(myCol).
  show

So, the exercise is to build myCol-like sequence of columns using coalesce function for every column in the schema (which looks like a fairly easy programming assignment :))