Spark Mllib - Scala

2019-09-15 19:56发布

问题:

I have a dataset containing the Customer_ID and the Movies that each customer have seen.

I am analyzing the pattern over the movies. Like If customer X see movie Y then he also se movie Z.

I already group my dataset by Customer ID and I've the sample of data:

Customer_ID,Movie_ID
1,         2,1,3
2,         1
3,         3,6,8

What I want is ignore the column Customer_ID and only having the list of movies like this:

2,1,3
1
3,6,8

How can I do this? My code is:

 val data = sc.textFile("FILE");

    case class Movies(Customer_ID:String,Movie_ID:String);

    def csvToMyClass(line: String) = {
    val split = line.split(',');
    Movies(split(0),split(1))
    }

     val df = data.map(csvToMyClass).toDF("Customer_ID","Movie_ID");

     df.show;

    val movies = df.groupBy(col("Customer_ID")).agg(collect_list(col("Movie_ID")) as "Movie_ID").withColumn("Movie_ID", concat_ws(",", col("Movie_ID"))).rdd

Thanks