I am getting error while doing outer joins on 2 data frames. I am trying to get the percentile.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
val res = df.withColumn("visited", explode($"visited"))
val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))
val result2 = res
.filter($"visited.rating" < 4)
.groupBy($"customerId", $"visited.placeName")
.agg(count("*").alias("top"))
result1.show()
result2.show()
val temp = result1.join(result2, List("placeName","customerId"), "outer")
temp.registerTempTable("percentile")
sqlContext.sql("select top/total as Percentage from percentile groupBy placeName")
The Error I am getting :
<console>:43: error: type mismatch; found : List[String] required: org.apache.spark.sql.Column –
I am using spark 1.5
Can anyone tell me what I am doing wrong here ? I tried this as well :
val temp = result1.join(result2, Seq("placeName","customerId"), "outer")
But still getting :
found : Seq[String] required: org.apache.spark.sql.Column
With this :
val res = df.withColumn("visited", explode($"visited"))
val file1 =res.groupBy( $"visited.placeName").agg(count("*").alias("total"))
val file2 = res.filter($"visited.rating" > 3).groupBy($"visited.placeName").agg(count("*").alias("top"))
file1.show()
file2.show()
file1.join(file2)
I am getting duplicate columns.
My Schema :
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "2",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 1 count = 3
Canada top = 1 count = 3
{
"country": "Canada",
"customerId": "Canada012",
"visited": [
{
"placeName": "UK",
"rating": "3",
"famousRest": "N/A",
"placeId": "XSdce2"
},
]
}
UK top = 1 count = 1
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "4.3",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 2 count = 3
Canada top = 1 count = 3
So at the end I need something like :
PlaceName Percentage
US 57.14 (1+1+2)/(3+1+3) *100
Canada 33.33 (1+1)/(3+3) *100
UK 100 1*100
Schema:
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)