I'm using MySQL. I have a table which looks like that:
id: primary key
content: varchar
weight: int
What I want to do is randomly select one row from this table, but taking into account the weight. For example, if I have 3 rows:
id, content, weight
1, "some content", 60
2, "other content", 40
3, "something", 100
The first row has 30% chance of being selected, the second row has 20% chance of being selected, and the third row has 50% chance of being selected.
Is there a way to do that ? If I have to execute 2 or 3 queries it's not a problem.
A simple approach (avoiding joins or subqueries) is to just multiply the weight by a random number between 0 and 1 to produce a temporary weight to sort by:
To understand this, consider thatRAND() * 2x
will be a larger value thanRAND() * x
approximately two thirds of the time. Consequently, over time each row should be selected with a frequency that's proportional to its relative weight (eg. a row with weight 100 will be selected about 100 times more often than a row with weight 1, etc).Update: this method doesn't in fact produce the correct distributions, so for now don't use it! (see the comments below). I think there should still be a simple method similar to the above that will work, but for now the more complex method below, involving joins, might be better. I'm leaving this answer up because: (a) there's relevant discussion in the comments below, and (b) if/when I get a chance, I'll try to fix it.
Maybe this one:
Or this one:
This works in MSSQL and I am sure that it should be possible to change couple of keywords to make it work in MySQL as well (maybe even nicer):
The idea is to have a cumulative weight for each row (subselect-1), then find the position of the spanned RAND() in this cumulative range.
This one seems to work, but I'm not sure of the math behind it.
My guess at the reason it works is that the ascending order looks for the smallest results and by dividing by the weight for higher weights the random result is clustered more tightly near zero.
I tested it (actually the same algorithm in postgresql) with 209000 queries over 3000 rows and the weight representation came out correct.
my input data:
my results:
The
+(g.i*0)
has no effect on the arithmetic result but an external reference is required to to force the planner to re-evaluate the sub-select for each of the 209K input rows produced in ing
I don't remember how to RND() in mysql, but here working example for MSSQL:
If TOP(1) is not applicable you just fetch first record from total result set.
I have tried van's solution and, although it works, it is not quick.
My Solution
The way that I am solving this problem is by maintaining a separate, linked table for the weighting. The basic table structure is similar to this:
If I have a record in
table1
with a weight of 3, then I create 3 records intable1_weight
, linked totable1
via thetable1_id
field. Whatever the value ofweight
is intable1
, that's how many linked records I create intable1_weight
.Testing
On a dataset with 976 records in
table1
with a total weight of 2031 and hence 2031 records intable1_weight
, I ran the following two SQLs:1) A version of van's solution
2) Joining to a secondary table for the weighting
SQL 1 takes consistently 0.4 seconds.
SQL 2 takes between 0.01 and 0.02 seconds.
Conclusion
If the speed of selection of a random, weighted record is not an issue, then the single table SQL suggested by van is fine and doesn't have the overhead of maintaining a separate table.
If, as in my case, a short selection time is critical, then I would recommend the two table method.
P.S. This is my first StackOverflow post and it's taken me ages, so hopefully someone will find it helpful!