I have a dataset containing the Customer_ID and the Movies that each customer have seen.
I am analyzing the pattern over the movies. Like If customer X see movie Y then he also se movie Z.
I already group my dataset by Customer ID and I've the sample of data:
Customer_ID,Movie_ID
1, 2,1,3
2, 1
3, 3,6,8
What I want is ignore the column Customer_ID and only having the list of movies like this:
2,1,3
1
3,6,8
How can I do this? My code is:
val data = sc.textFile("FILE");
case class Movies(Customer_ID:String,Movie_ID:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Movies(split(0),split(1))
}
val df = data.map(csvToMyClass).toDF("Customer_ID","Movie_ID");
df.show;
val movies = df.groupBy(col("Customer_ID")).agg(collect_list(col("Movie_ID")) as "Movie_ID").withColumn("Movie_ID", concat_ws(",", col("Movie_ID"))).rdd
Thanks