I found PySpark has a method called drop
but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?
df.drop(['col1','col2'])
TypeError Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])
/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
1257 jdf = self._jdf.drop(col._jc)
1258 else:
-> 1259 raise TypeError("col should be a string or a Column")
1260 return DataFrame(jdf, self.sql_ctx)
1261
TypeError: col should be a string or a Column
Simply with select
:
df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])
or if you really want to use drop
then reduce
should do the trick:
from functools import reduce
from pyspark.sql import DataFrame
reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)
Note:
(difference in execution time):
There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.
There is a difference however when we analyze driver-side code:
- the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
- the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
- finally comprehensions are significantly faster in Python than methods like
map
or reduce
- Spark 2.x+ supports multiple columns in
drop
. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.
In PySpark 2.1.0 method drop
supports multiple columns:
PySpark 2.0.2:
DataFrame.drop(col)
PySpark 2.1.0:
DataFrame.drop(*cols)
Example:
df.drop('col1', 'col2')
The right way to do this is:
df.drop(*['col1', 'col2', 'col3'])
The * needs to come outside of the brackets if there are multiple columns to drop.