I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set.
The complete data file is of the format(userID,MovID,Rating,Timestamp):
res8: Array[String] = Array(1, 31, 2.5, 1260759144)
The test data file is of the format(userID,MovID):
res10: Array[String] = Array(1, 1172)
How do I generate ratings_train that will not have the caes matched with the testing dataset I am using the following function but the returned list is showing empty:
def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
ratings_testing.foreach(y => {
if (x(0) != y(0) || x(1) != y(1)) {
ratings_train += x
}
})
})
return ratings_train
}
EDIT: changed code but running into memory issues.
This may work.
Using Regex: