I have a list of random products(1000's) each with an ID and I am bringing them up randomly. I would like that the items are not repeated. Currently what I am doing is the following:
select * from products where product_id <> previous_product_id order by rand() limit 1;
I am ensuring that the same product will not appear directly after. A repeat product usually appears a lot sooner then I would like (I believe I am right in saying this is the birthday problem). I have no idea what is the most effective way to get random data in a non-repeating fashion. I have thought of a way that I assume is highly inefficent:
I would assign the user an id (e.g. foo) and then when they have seen an item add it to a string that would be product_id_1 AND product_id_2 AND product_id_3 AND product_id_n
. I would store this data with timestamp(explained further on)
.
+--------------------------------------------------------------------------------------------+
|user_id |timestamp | product_seen_string |
|--------------------------------------------------------------------------------------------|
|foo |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+
With this product_seen_string
I would keep adding to the seen products (I would also update the timestamp) and then in the query I would do a first query based on the user_id
to obtain this string and then add that returned string to the main query that obtains the random products like so:
select * from products where product_id <> product_id_1 AND product_id_2 AND product_id_3 AND product_id_n order by rand() limit 1;
I would also write into that if no products were returned then the data would be deleted so that the cycle can start again. As well as having a cron job that every ten minutes would run to see if the timestamp is older then an hour I would delete it.
The scripting language I am using is PHP
Selecting random rows is always tricky, and there are no perfect solutions that don't involve some compromise. Either compromise performance, or compromise even random distribution, or compromise on the chance of selecting duplicates, etc.
As @JakeGould mentions, any solution with ORDER BY RAND()
does not scale well. As the number of rows in your table gets larger, the cost of sorting the whole table in a filesort gets worse and worse. Jake is correct that the query cannot be cached when the sort order is random. But I don't care about that so much because I usually disable the query cache anyway (it has its own scalability problems).
Here's a solution to pre-randomize the rows in the table, by creating a rownum column and assigning unique consecutive values:
ALTER TABLE products ADD COLUMN rownum INT UNSIGNED, ADD KEY (rownum);
SET @rownum := 0;
UPDATE products SET rownum = (@rownum:=@rownum+1) ORDER BY RAND();
Now you can get a random row by an index lookup, without sorting:
SELECT * FROM products WHERE rownum = 1;
Or you can get the next random row:
SELECT * FROM products WHERE rownum = 2;
Or you can get 10 random rows at a time, or any other number you want, with no duplicates:
SELECT * FROM products WHERE rownum BETWEEN 11 and 20;
You can re-randomize anytime you want:
SET @rownum := 0;
UPDATE products SET rownum = (@rownum:=@rownum+1) ORDER BY RAND();
It's still costly to do the random sorting, but now you don't have to do it on every SELECT query. You can do it on a schedule, hopefully at off-peak times.
You might consider a pagination-like solution. Rather than ordering by RAND()
(not good form a performance standpoint anyway), why not simply use LIMIT
clause and randomize the offset.
For example:
SELECT product_id FROM products ORDER BY product_id LIMIT X, 1
Where X
is the offset you want to use. You could easily keep track of the offsets used in the application and randomize amongst the available remaining values.
PHP code to call this might look like this:
if(!isset($_SESSION['available_offsets'] || count($_SESSION['available_offsets']) === 0) {
$record_count = ...; // total number of records likely derived from query against table in question
// this creates array with numerical keys matching available offsets
// we don't care about the values
$available_offsets = array_fill(0, $record_count, '');
} else {
$available_offsets = $_SESSION['available_offsets'];
}
// pick offset from available offsets
$offset = array_rand($available_offsets);
// unset this as available offset
unset($available_offsets[$offset]);
// set remaining offsets to session for next page load
$_SESSION['available_offsets'] = $available_offsets;
$query = 'SELECT product_id FROM products ORDER BY product_id LIMIT ' . $offset . ', 1';
// make your query now
You could try adding a seed to the RAND to avoid repeats
select * from products where product_id <> previous_product_id order by rand(7) limit 1;
From http://www.techonthenet.com/mysql/functions/rand.php :
The syntax for the RAND function in MySQL is:
RAND( [seed] )
Parameters or Arguments
seed
Optional. If specified, it will produce a repeatable sequence of random numbers each time that seed value is provided.
First, unclear what scripting language you will use to piece this together, so will answer conceptually. I also added uppercase to your MySQL for readability. So first let’s look at this query:
SELECT * FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 1;
Superficially does what you want it to do, but if your database has thousands of items, RAND()
is not recommended. It defeats all MySQL caching & is a resource hog. More details here, specially the area that reads:
A query cannot be cached if it contains any of the functions shown in
the following table.
That’s just not good.
But that said you might be able to improve it by just returning the product_id
:
SELECT product_id FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 1;
And then you would have to do another query for the actual product data based on the product_id
, but that would be much less taxing on the server than grabbing a whole set of randomized data.
But still, RAND()
inherently will bog down your system due to lack of caching. And the problem will only get worse as your database grows.
Instead I would recommend having some kind of file-based solution. That grabs a random list of items like this:
SELECT product_id FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 0,100;
You will strictly be grabbing product ids and then saving them to—let’s say—a JSON file. The logic being is as random as you want this to be, you have to be realistic. So grabbing a slice of 100 items at a time will give a user a nice selection of items.
Then you would load that file in as an array and perhaps even randomize the array & grab one item off of the top; a quick way to assure that the item won’t be chosen again. I you wish you could shorten the array with each access by re-saving the JSON file. And then when it gets down to—let’s say—less than 10 items, fetch a new batch & start again.
But in general you RAND()
is a neat trick that is useful for small datasets. Anything past that should be re-factored into code where you realistically leverage what you want users to see versus what can realistically be done in a scalable manner.
EDIT: Just another thing I noticed in your code on keeping track of product IDs. If you want to go down that route—instead of saving items to a JSON file—that would be fine as well. But what you are describing is not efficient:
+--------------------------------------------------------------------------------------------+
|user_id |timestamp | product_seen_string |
|--------------------------------------------------------------------------------------------|
|foo |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+
A better structure would be to simply store the IDs in a MySQL table like this; let’s call it products_seen
:
+----------------------------------------------------+
|user_id |timestamp | product_seen_id |
|----------------------------------------------------|
|foo |01-01-14 12:00:00 |product_id_1 |
|foo |01-01-14 12:00:00 |product_id_2 |
|foo |01-01-14 12:00:00 |product_id_3 |
+----------------------------------------------------+
And then have a query like this
SELECT * FROM products, products_seen WHERE user_id = products_seen.user_id product_id <> product_seen_id ORDER BY RAND() LIMIT 1;
Basically, you would be cross referencing products_seen
with products
and making sure the products already seen by user_id
are not shown again.
This is all pseudo-code, so please adjust to fit the concept.