First, I have some use history of user's app.
For example:
user1, app1, 3(launch times)
user2, app2, 2(launch times)
user3, app1, 1(launch times)
I have basically two demands:
- Recommend some app for every user.
- Recommend similar app for every app.
So I use ALS(implicit) of MLLib on spark to implement it. At first, I just use the original data to train the model. The result is terrible. I think it may caused by the range of launch times. And the launch time range from 1 to thousands. So I process the original data. I think score can reflect the true situation and more regularization.
score = lt / uMlt + lt / aMlt
score is process result to train model.
lt is launch times in original data.
uMlt is user's mean launch times in original data. uMlt(all launch times of a user) / (number of app this user ever launched)
aMlt is app's mean launch times in original data. aMlt(all launch times of a app) / (number of user who ever launched this app)
Here is some example of the data after processing.
Rating(95788,20992,0.14167073369026184)
Rating(98696,20992,5.92363166809082)
Rating(160020,11264,2.261538505554199)
Rating(67904,11264,2.261538505554199)
Rating(268430,11264,0.13846154510974884)
Rating(201369,11264,1.7999999523162842)
Rating(180857,11264,2.2720916271209717)
Rating(217692,11264,1.3692307472229004)
Rating(186274,28672,2.4250855445861816)
Rating(120820,28672,0.4422124922275543)
Rating(221146,28672,1.0074234008789062)
After I have done this, and aggregate the apps which have different package name, the result seems better. But still not good enough.
I find that the features of users and products is so small, and most of them is negative.
Here is 3 line example of products features, 10 dimensions for each line:
((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7,-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))
Here is 3 line example of users features, 10 dimensions for each line:
(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079))
(97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))
So you can imagine how small when I get dot product of the feature vectors to compute value of user-item matrix.
My question here is :
- Is there any other way to improve the recommendation result?
- Does my features seem right, or there's something going wrong?
- Is my way to process the original launch times(convert to score) right?
I put some code here. And this is absolutely a program question. But maybe can't be solved by a few lines of code.
val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha)
print("recommendForAllUser")
val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map {
case (uid, (appArray, mac)) => {
(mac, appArray.map {
case (appId, rating) => {
val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER)
(packageName, rating)
}
})
}
}
HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => {
val mac = x._1
val products = x._2.map {
case (packageName, rating) => packageName + "=" + rating
}.mkString(",")
val putMap = Map("apps" -> products)
(new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap))
})
print("recommendSimilarApp")
println("productFeatures ******")
model.productFeatures.take(1000).map{
case (appId, features) => {
val packageNameList = appIdPackageNameListDict.value.get(appId)
val packageNameListStr = if (packageNameList.isDefined) {
packageNameList.mkString("(", ",", ")")
} else {
"Unknow List"
}
(packageNameListStr, features.mkString("(", ",", ")"))
}
}.foreach(println)
println("productFeatures ******")
model.userFeatures.take(1000).map{
case (userId, features) => {
(userId, features.mkString("(", ",", ")"))
}
}.foreach(println)
val similarAppRdd = recommendSimilarApp(model, topN).flatMap {
case (appId, similarAppArray) => {
val groupedAppList = appIdPackageNameListDict.value.get(appId)
if (groupedAppList.isDefined) {
val similarPackageList = similarAppArray.map {
case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating)
}
groupedAppList.get.map(packageName => {
(packageName, similarPackageList)
})
} else {
None
}
}
}
HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => {
val packageName = x._1
val products = x._2.map {
case (packageName, rating) => packageName + "=" + rating
}.mkString(",")
val putMap = Map("apps" -> products)
(new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap))
})
UPDATE :
I found something new about my data after reading the paper("Collaborative Filtering for Implicit Feedback Datasets"). My data is too sparse compare to the IPTV data set described in the paper.
Paper: 300,000(users) 17,000(products) 32,000,000(data)
Mine: 300,000(users) 31,000(products) 700,000(data)
So the user-item matrix in the paper's data set has been filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data set's ratio is 0.0000033. I think it means that my user-item matrix is 2000 times sparser than the paper's.
Should this lead to a bad result? And any way to improve it?