How to create my own recommendation engine? [close

2019-01-20 21:16发布

I am interested in recommendation engines these days and I want to improve myself in this area. I am currently reading "Programming Collective Intelligence" I think this is the best book about this subject, from O'Reilly. But I don't have any ideas how to implement engine; What I mean by "no idea" is "don't know how to start". I have a project like Last.fm in my mind.

  1. Where do (should be implemented on database side or backend side) I start creating recommendation engine?
  2. What level of database knowledge will be needed?
  3. Is there any open source ones that can be used for help or any resource?
  4. What should be the first steps that I have to do?

5条回答
等我变得足够好
2楼-- · 2019-01-20 21:53

This is really a very big question you are asking, so even if I could give you a detailed answer I doubt I'd have the time.... but I do have a suggestion, take a look at Greg Linden's blog and his papers on item-based collaborative filtering. Greg implemented the idea of recommendations engines at Amazon using the item based approach, he really knows his stuff and his blog and papers are very readable.

Blog: http://glinden.blogspot.com/ Paper: http://www.computer.org/portal/web/csdl/doi/10.1109/MIC.2003.1167344 (I'm afraid you need to log in to read it in full, as you are a CS student this should be possible).

Edit You could also take a look at Infer.Net, they include an example of building a recommender system for movie data.

查看更多
beautiful°
3楼-- · 2019-01-20 21:59

An example recommendation engine that is open source (AGPLv3-licensed) has been published by Filmaster.com recently. It's written in C++ and uses best practices from the white papers produced as part of the Netflix challange. An article about it can be found at: http://polishlinux.org/gnu/open-source-film-recommendation-engine/ and the code is here: http://bitbucket.org/filmaster/filmaster-test/src/tip/count_recommendations.cpp

查看更多
SAY GOODBYE
4楼-- · 2019-01-20 22:02

I've built up one for a video portal myself. The main idea that I had was about collecting data about everything:

  • Who uploaded a video?
  • Who commented on a video?
  • Which tags where created?
  • Who visited the video? (also tracking anonymous visitors)
  • Who favorited a video?
  • Who rated a video?
  • Which channels was the video assigned to?
  • Text streams of title, description, tags, channels and comments are collected by a fulltext indexer which puts weight on each of the data sources.

Next I created functions which return lists of (id,weight) tuples for each of the above points. Some only consider a limited amount of videos (eg last 50), some modify the weight by eg rating, tag count (more often tagged = less expressive). There are functions that return the following lists:

  • Similar videos by fulltext search
  • Videos uploaded by the same user
  • Other videos the users from these comments also commented on
  • Other videos the users from these favorites also favorited
  • Other videos the raters from these ratings also rated on (weighted)
  • Other videos in the same channels
  • Other videos with the same tags (weighted by "expressiveness" of tags)
  • Other videos played by people who played this video (XY latest plays)
  • Similar videos by comments fulltext
  • Similar videos by title fulltext
  • Similar videos by description fulltext
  • Similar videos by tags fulltext

All these will be combined into a single list by just summing up the weights by video ids, then sorted by weight. This works pretty well for around 1000 videos now. But you need to do background processing or extreme caching for this to be speedy.

I'm hoping that I can reduce this to a generic recommendation engine or similarity calculator soon and release as a rails/activerecord plugin. Currently it's still a well integrated part of my project.

To give a small hint, in ruby code it looks like this:

def related_by_tags
  tag_names.find(:all, :include => :videos).inject([]) { |result,t|
    result + t.video_ids.map { |v|
      [v, TAG_WEIGHT / (0.1 + Math.log(t.video_ids.length) / Math.log(2))]
    }
  }
end

I would be interested on how other people solve such algorithms.

查看更多
爱情/是我丢掉的垃圾
5楼-- · 2019-01-20 22:11

Presenting recommendations can be split up in to two main sections:

  1. Feature extraction
  2. Recommendation

Feature extraction is very specific to the object being recommended. For music, for example, some features of the object might be the frequency response of the song, the power, the genre, etc. The features for the users might be age, location, etc. You then create a vector for each user and song with the various elements of the vector corresponding to different features of interest.

Performing the actual recommendation only requires well thought out feature vectors. Note that if you don't choose the right features your recommendation engine will fail. This would be like asking you to tell me my sex based on my age. Of course my age may provide a bit of information, but I think you could imagine better questions to ask. Anyways, once you have your feature vectors for each user and song, you will need to train the recommendation engine. I think the best way to do this would be to get a whole bunch of users to take your demographic test and then tell you specific songs that they like. At this point you have all the information you need. Your job is to draw a decision boundary with the information you have. Consider a simple example. You want to predict whether or not a user likes AC/DC's "Back in Black" based on age and sex. Imagine a graph showing 100 data points. The x axis is age, the y axis is sex (1 is male, 2 is female). A black mark indicates that the user likes the song while a red mark means they don't like the song. My guess is that this graph might have a lot of black marks corresponding to users that are male and between the ages of 12 and 37 while the rest of the marks will be red. So, if we were to manually select a decision boundary, it'd be a rectangle around this area holding the majority of the black marks. This is called the decision boundary because, if a completely new person comes to you and tells you their age and sex, you only have to plot them on the graph and ask whether or not they fall within that box.

So, the hard part here is finding the decision boundary. The good news is that you don't need to know how to do that. You just need to know how to use some of the common tools. You can look into using neural networks, support vector machines, linear classifiers, etc. Again, don't let the big names fool you. Most people can't tell you what these things are really doing. They just know how to plug things in and get results.

I know it's a bit late, but I hope this helps anyone that stumbles on this thread.

查看更多
倾城 Initia
6楼-- · 2019-01-20 22:17

I have a 2 part blog on collaborative filtering based recommendation engine for implementation in Hadoop.

http://pkghosh.wordpress.com/2010/10/19/recommendation-engine-powered-by-hadoop-part-1/

http://pkghosh.wordpress.com/2010/10/31/recommendation-engine-powered-by-hadoop-part-2/

Here is the github repository for the open source project https://github.com/pranab/sifarish

Feel free to use if you like it.

查看更多
登录 后发表回答