Select random row from MySQL (with probability)

2019-04-07 23:07发布

问题:

I have a MySQL table that has a row called cur_odds which is a percent number with the percent probability that that row will get selected. How do I make a query that will actually select the rows in approximately that frequency when you run through 100 queries for example?

I tried the following, but a row that has a probability of 0.35 ends up getting selected around 60-70% of the time.

SELECT * FROM table ORDER BY RAND()*cur_odds DESC

All the values of cur_odds in the table add up to 1 exactly.

回答1:

If cur_odds is changed rarely you could implement the following algorithm:

1) Create another column prob_sum, for which

prob_sum[0] := cur_odds[0]

for 1 <= i <= row_count - 1:

prob_sum[i] := prob_sum[i - 1] + cur_odds[i]

2) Generate a random number from 0 to 1:

rnd := rand(0,1)

3) Find the first row for which prob_sum > rnd (if you create a BTREE index on the prob_sum, the query should work much faster):

CREATE INDEX prob_sum_ind ON <table> (prob_sum);

SET @rnd := RAND();

SELECT MIN(prob_sum) FROM <table> WHERE prob_sum > @rnd;



回答2:

Given your above SQL statement, whatever numbers you have in cur_odds are not the probabilities that each row is selected, but is instead just an arbitrary weighting (relative to the "weights" of all the other rows) which could instead be best interpreted as a relative tendency to float towards the top of the sorted table. The actual value in each row is meaningless (e.g. you could have 4 rows with values of 0.35, 0.5, 0.75 and 0.99, or you could have values of 35, 50, 75 and 99, and the results would be the same).

Update: Here's what's going on with your query. You have one row with a cur_odds value of 0.35. For the sake of illustration, I'm going to assume that the other 9 rows all have the same value (0.072). Also for the sake of illustration, let's assume RAND() returns a value from 0.0 to 1.0 (it may actually).

Every time you run this SELECT statement, each row is assigned a sorting value by multiplying its cur_odds value by a RAND() value from 0.0 to 1.0. This means that the row with a 0.35 will have a sorting value between 0.0 and 0.35.

Every other row (with a value of 0.072) will have sorting values ranging between 0.0 and 0.072. This means that there is an approximately 80% chance that your one row will have a sorting value greater than 0.072, which would mean that there is no possible chance that any other row could be sorted higher. This is why your row with the cur_odds value of 0.35 is coming up first more often than you expect.

I incorrectly described the cur_odds value as a relative change weighting. It actually functions as a maximum relative weighting, which would then involve some complex math to determine the actual relative probabilities involved.

I'm not sure what you need can be done with straight T-SQL. I've implemented a weighted probability picker many times (I was even going to ask a question about best methods for this this morning, ironically) but always in code.