I am trying to build a poor man's recommendation system for a online store. I want to realize that kind of Amazon "Customers Who Bought This Item Also Bought" feature and I read a lot about it. I know there is that Apache Mahout thing, but I am unable to tweak the server that way. Then there would be the google prediction API, but it cost money so I start experimenting myself.
I got an orderhistory with 250.000+ items and I wrote a nested MySQL Query to find orders which contain the current article, rank the other order items and sort that table for ranking, so I got a set of products which other people ordered along with the current article.
The problem is, the query could take up to 10sec - so this can't be used directly. I thought about a caching table, but this query stops after 20 minutes (there are 60.000 products and 250.000 ordered items) So I am unable to fill that table.
My current workaround is the following: The recommendation HTML is loaded via AJAX ondocumentready, so the site loads, while the recommendation loads in the background. The recommendation data is processed once and stored in a filecache (PEAR simple cache) so it loads faster the next time. So the cache is made on demand if someone visits the site and stored for a day or maybe a week.
I ask myself and you, would that be an acceptable approach or is it stupid and unperformant? Would it be better to store the cached data in a db or in file (I think about performance and parallel hits). I mean, in the worst case I would endup with 60.000 cachefiles.
I would prefer a pre-computed table with all the data, but as I said it takes to long and I don't know how to optimize it. (Waiting till the SQL Dude come back from holidays ^^)
Thanks for any hint, opinion.
btw. this is the query:
SELECT c.ArtNr as artnr , count(c.ArtNr) as rank, s.ArtNr as parent_artnr
FROM (
SELECT a.ID_order, a.ArtNr
FROM net_orderposition a
WHERE a.ArtNr = 'TT-PV0005'
) s
JOIN net_orderposition c
WHERE s.ID_order = c.ID_order AND s.ArtNr != c.ArtNr
GROUP BY c.ArtNr
ORDER BY rank DESC,c.Stamp DESC
LIMIT 10;
EDIT:
I thought about the given answers and I think they are similar to my initial idea. The above code result in the following table:
ID,ParentID , ChildID , Rank
1, TT-PV0005, TT-PV0040, 220
2, TT-PV0005, TT-PV0355, 135
3, TT-PV0005, TT-PV0450, 134
4, TT-PV0005, TT-PV0451, 89
5, TT-PV0005, RH-01V2 , 83
6, TT-PV0005, TT-PV0041, 83
7, TT-PV0005, TT-PV0353, 82
8, TT-PV0005, TT-PV0037, 80
The ParentID is the current item, ChildID the items that ordered in the past along with ParentID, Rank is the precomputed count of how often the child is ordered with current item. Now I can UPDATE or INSERT related items on every new order and count up Rank if it's already present in DB. The only thing I fear, I will endup in a really really big table. Maybe it shouldn't be a problem, if I precalculate it offline once a week? But then I have to optimize the query so it doesn't take 10 sec per item.
What do you think?