I have a stores
and categories
model. A store can have many categories.
I'm trying to create a Related Stores list for each store.
I'd like to calculate a score based on the # of common categories a store shares with another.
I have the plan, but not sure how to begin coding this in Ruby on Rails.
Any advice?
PS. I think it would be best to have a separate table to store this calculated data for each store -- since doing it realtime would be intense on the DB.
UPDATE I just spotted a MAJOR flaw in my logic for this -- Just a few department stores like Amazon will dominate the related stores for all merchants (since they belong to almost all categories and thus will match every category for niche stores). Any way around this issue?
Your "MAJOR flaw" isn't uncommon. As you say, Amazon is going to be "related" to everything. This is a pretty common problem with any kind of recommendation system that tries to use such relationships. I've not done this with store categories, but the problem is very similar to the video selection/ranking system that I built.
A common way to help prevent the popular stuff from dominating is, rather than using the count of matching categories, you give weights to the scores for each store. Common weighting factors are 1/category_count
or 1/sqrt(category_count)
.
Imagine three stores:
Jim's Books - 2 categories: ["Books", "Music"]
Amazon - 10 categories: ["Books", "Music", "Movies", "Housewares", etc.]
Ralph's Remainders - 3 categories: ["Books", "Music", "Movies"]
Now, if you're looking for stores that are similar to Jim's Books, you match up the categories. Obviously, both Amazon and Ralph's include the categories "Books" and "Music", and if you were to use just the count of matched categories, both would have the same score.
But if you use a weighting factor, then their scores are much different. With a weighting factor of 1/category_count
:
Amazon - 10 categories, weighting factor = 1/10.
Ralph's - 3 categories, weighting factor = 1/3.
So Amazon would get a similarity score of 0.20, and Ralph's would get a similarity score of 0.66.
If the weighting factor is 1/sqrt(category_count)
, then:
Amazon - weighting factor = 1/sqrt(10) = 0.316
Ralph's - weighting factor = 1/sqrt(3) = 0.562
In this case, Amazon's score is about 0.632, and Ralph's score is 1.124.
I've found that the 1/sqrt(category_count)
is generally better because it dampens the overpowering effect of the highly popular stores (i.e. those that have lots of categories), but not so much that those stores don't get into the results. Using 1/category_count
gives too much emphasis to stores that have only one or two categories.
If assume that you have models:
class Store < ActiveRecord:Base
has_many :categories_stores
has_many :categories, :throught => :categories_stores
end
class CategoriesStore < ActiveRecord::Base
belongs_to :category
belongs_to :store
end
class Category < ActiveRecord::Base
has_many :categories_stores
has_many :categories, :throught => :categories_stores
end
The main algorithm in words would be:
1. Find a categories (ids) which have the selected Store.
2. Find the Stores, that have any of the categories from step 1.
3. Count the Categories for each of the found store that are from categories list 1.
This all can be done by several ways in SQL. For example:
SELECT s3.store_id, COUNT(s3.category_id) FROM categories_stores s1, categories_stores s2, categories_stores s3 WHERE s1.store_id = :id and s2.category_id = s1.category_id and s3.store_id = s2.store_id and s3.category_id = s1.category_id GROUP BY s3.store_id
Where :id - is parameter for query. Some pars of the query could be done by pure ruby, some not.