I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement.
So i have created a Scala List of 100 column names.
And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
Below is the code.
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
Answer:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
df.drop("colA", "colB", "colC")
You can just do,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFrame
without the columns passed in dropList
.
As an example (of what's happening behind the scene), let me put it this way.
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _
.