origin data
cls, id
a, 1
a, 1
b, 3
b, 3
b, 4
expected output
cls, id
a, 1
b, 3
b, 4
id can be duplicates only in same cls, It means same id do not exist across clses.
In that case.
will shuffle across all partitions to check duplicates over cls. and repartitioned to 200(default value)
Now, How can I run dropDuplicates for each partition seperately to reduce computing cost?
something like