i need a Pyspark solution for Pandas drop_duplicates(keep=False)
. Unfortunately, the keep=False
option is not available in pyspark...
Pandas Example:
import pandas as pd
df_data = {'A': ['foo', 'foo', 'bar'],
'B': [3, 3, 5],
'C': ['one', 'two', 'three']}
df = pd.DataFrame(data=df_data)
df = df.drop_duplicates(subset=['A', 'B'], keep=False)
print(df)
Expected output:
A B C
2 bar 5 three
A conversion .to_pandas()
and back to pyspark is not an option.
Thanks!
Use window function to count the number of rows for each
A / B
combination, and then filter the result to keep only rows that are unique:Or another option using
pandas_udf
: