Remove all rows that are duplicates with respect t

2019-08-07 08:24发布

问题:

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|283924|  1.5|   0|
|283924|  1.5|   1|
|982384|  3.0|   0|
|982384|  3.0|   1|
|892383|  2.0|   0|
|892383|  2.5|   1|
+------+-----+----+

I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.

In this case:

  • Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
  • Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:

The output would be:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|892383|  2.5|   1|
|892383|  2.0|   0|
+------+-----+----+

I've tried

df.dropDuplicates(subset = ['id', 'value'], keep = False)

But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.

How else could I do this?

回答1:

You can do that using the window functions

from pyspark.sql import Window, functions as F
df.withColumn(
  'fg', 
  F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()


回答2:

You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:

df.join(
    df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#|    id|value|type|
#+------+-----+----+
#|892383|  2.5|   1|
#|892383|  2.0|   0|
#+------+-----+----+