How do you randomly select a table row in T-SQL based on an applied weight for all candidate rows?
For example, I have a set of rows in a table weighted at 50, 25, and 25 (which adds up to 100 but does not need to), and I want to select one of them randomly with a statistical outcome equivalent to the respective weight.
If you need to do get a group of samples (say, you want to sample 50 rows from a collection of 5M rows) where each row has a column called
Weight
which is anint
and where larger values means more weight, you can use this function:The key here is using the POWER( ) function as illustrated here
A reference on the choice of a random function is here and here
Alternatively you can use:
You cast checksum as
BIGINT
instead ofINT
because of this issue:Dane's answer includes a self joins in a way that introduces a square law.
(n*n/2)
rows after the join where there are n rows in the table.What would be more ideal is to be able to just parse the table once.
This will go through the table, setting
@id
to each record'sid
value while at the same time decrementing@weight
point. Eventually, the@weight_point
will go negative. This means that theSUM
of all preceding weights is greater than the randomly chosen target value. This is the record we want, so from that point onwards we set@id
to itself (ignoring any IDs in the table).This runs through the table just once, but does have to run through the entire table even if the chosen value is the first record. Because the average position is half way through the table (and less if ordered by ascending weight) writing a loop could possibly be faster... (Especially if the weightings are in common groups):
You simply need to sum the weights of all candidate rows, then choose a random point within that sum, then select the record that coordinates with that chosen point (each record is incrementally carrying an accumulating weight sum with it).
The "incrementally carrying a an accumlating[sic] weight sum" part is expensive if you have a lot of records. If you also already have a wide range of scores/weights (ie: the range is wide enough that most records weights are unique. 1-5 stars probably wouldn't cut it), you can do something like this to pick a weight value. I'm using VB.Net here to demonstrate, but this could easily be done in pure Sql as well:
Run this, and pick the record with the largest score less than the returned weight. If more than one record share that score, pick it at random. The advantages here are that you don't have to maintain any sums, and you can tweak the probability equation used to suit your tastes. But again, it works best with a larger distribution of scores.
The way to do this with random number generators is to integrate the probabiliity density function. With a set of discrete values you can calculate the prefix sum (sum of all values up to this one) and store it. With this you select the minumum prefix sum (aggregate to date) value greater than the random number.
On a database the subsequent values after an insertion have to be updated. If the relative frequency of updates and size of the data set doesn't make the cost of doing this prohibitive it means that the appropriate value can be obtained in from a single s-argable (predicate that can be resolved by an index lookup) query.