We have a website that contains a database of places. For each place our users are able to take one of the follow actions which we record:
- VIEW - View it's profile
- RATING - Rate it on a scale of 1-5 stars
- REVIEW - Review it
- COMPLETED - Mark that they've been there
- WISH LIST - Mark that they want to go there
- FAVORITE - Mark that it's one of their favorites
In our database table of places each place contains a count of the number of times each action above was taken as well as the average rating given by users.
- views
- ratings
- avg_rating
- completed
- wishlist
- favorite
What we want to be able to do is generate lists of the top places using the above information. Ideally, we would want to be able to generate this list using a relatively simple SQL query without needing to do any legwork to calculate additional fields or stack rank places against one another. That being said, since we only have about 50,000 places we could run a nightly cron job to calculate some fields such as rankings on different categories if it would make a meaningful difference in the overall results of our top places.
I'd appreciate if you could make some suggestions on how we should think about bubbling the best places to the top, which criteria we should weight more heavily, and given that information - suggest what the MySQL query would need to look like in order to select the top 10 places.
One thing to note is that at this time we are less concerned with the recency of a place being popular - meaning that looking at the aggregate information is fine and that more recent data doesn't need to be weighted more heavily.
Thanks in advance for your help & advice!
The exact formula for the favorite would need to come from you, we will call it f(x).
For the actual implementation I would add a popularity_score field that I would calculate, as you said from a nightly cron job using f(x) for each row.
Then it is simply a case of doing a "select place name from table order by popularity_score desc".
Ok - Let's give it a stab
popularity_score = (FAVORITE * 3 + COMPLETED * 2 + WISHLIST) * RATING * VIEW / AVG_VIEWS_OF_ALL_PROFILES
Use the lower bound of the Wilson score confidence interval for a Bernoulli parameter!
I don't have an opinion on how to weigh things.
That said, why not just add a popularity column to the location table? All of a sudden, your SQL query is incredibly simple.
The tricky part, of course, is figuring out how and when to update that value. But since you're saving all of the activity data, you can always regenerate the popularity values from the log entries.
That way, you get nice fast queries for "most popular" locations, and if you want to change the way popularity is computed, you can do so at will.
If you're clever, you might be able to devise a simple enough formula so that popularity can be tracked in real time. For instance, if we only cared about average ratings, you can modify the average rating with just three variables: the current average rating, the number of times the object has been rated, and the new rating value.
Of course, things get more complex when you start mixing in how many times the object has been viewed, reviewed, favorited, etc ... but you might find that you can devise a method that's computationally cheap enough that you can update the overall popularity value on just about every action.