I have a dateframe which have unique as well as repeated records on the basis of number. Now i want to split the dataframe into two dataframe. In first dataframe i need to copy only unique rows and in second dataframe i want all repeated rows. For example
id name number
1 Shan 101
2 Shan 101
3 John 102
4 Michel 103
The two splitted dataframe should be like
Unique
id name number
3 John 102
4 Michel 103
Repeated
id name number
1 Shan 101
2 Shan 101
The solution you tried could probably get you there.
Your data looks like this
val df = sc.parallelize(Array(
(1, "Shan", 101),
(2, "Shan", 101),
(3, "John", 102),
(4, "Michel", 103)
)).toDF("id","name","number")
Then you yourself suggest grouping and counting. If you do it like this
val repeatedNames = df.groupBy("name").count.where(col("count")>1).withColumnRenamed("name","repeated").drop("count")
then you could actually get all the way by doing something like this afterwards:
val repeated = df.join(repeatedNames, repeatedNames("repeated")===df("name")).drop("repeated")
val distinct = df.except(repeated)
repeated show
+---+----+------+
| id|name|number|
+---+----+------+
| 1|Shan| 101|
| 2|Shan| 101|
+---+----+------+
distinct show
+---+------+------+
| id| name|number|
+---+------+------+
| 4|Michel| 103|
| 3| John| 102|
+---+------+------+
Hope it helps.