Dropping multiple columns from Spark dataframe by

2019-04-02 01:44发布

问题:

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement. So i have created a Scala List of 100 column names. And then i want to iterate through a for loop to actually drop the column in each for loop iteration.

Below is the code.

final val dropList: List[String] = List("Col1","Col2",...."Col100”)

def drpColsfunc(inputDF: DataFrame): DataFrame = { 
    for (i <- 0 to dropList.length - 1) {
        val returnDF = inputDF.drop(dropList(i))
    }
    return returnDF
}

val test_df = drpColsfunc(input_dataframe) 

test_df.show(5)

回答1:

Answer:

val colsToRemove = Seq("colA", "colB", "colC", etc) 

val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*) 


回答2:

If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:

df.drop("colA", "colB", "colC")


回答3:

You can just do,

def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame = 
    dropList.foldLeft(inputDF)((df, col) => df.drop(col))

It will return you the DataFrame without the columns passed in dropList.

As an example (of what's happening behind the scene), let me put it this way.

scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)

scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)

scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)

The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.