spark dropDuplicates based on json array field

2019-03-03 14:21发布

I have json files of the following structure:

{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}

I want to read several such json files and distinct them based on the "name" column inside names. I tried

df.dropDuplicates(Array("names.name")) 

but it didn't do the magic.

2条回答
对你真心纯属浪费
2楼-- · 2019-03-03 14:44

just for future reference, the solution looks like

      val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY", 
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")
查看更多
做自己的国王
3楼-- · 2019-03-03 14:52

This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
  .dropDuplicates("DEDUP_KEY")
  .drop("DEDUP_KEY")
查看更多
登录 后发表回答