I have a table with data which I have to randomize. By randomizing, I mean use data from random row to update another row in that same column. Problem is that the table itself is big (more than 2 000 000 rows).
I wrote a piece of code which uses while loop, but it's going slow.
Does anyone have any suggestion about more efficient way of achieving randomization?
Based on Mitch Wheats answer linking to this article on scrambline data you can do something like this to scramble a bunch of fields, you're not just limited to the IDs:
So you aren't just limited to doing this as the article shows:
The danger to this approach is the amount of data you are tweaking. Using a CTE crams all that stuff into memory, so while I found that this is fairly speedy ( 19 seconds for a 500k row table ). You will want to be careful if you have a table that has millions of records. You should consider how much data is actually needed, or is a good population sample, for testing and development.
I combined the answers that I find above in a single query that randomizes each column again ending up with completely randomized records
This query took 8 seconds on a 10K rows table shuffling 8 columns on a Windows 2008 R2 machine with 16GB of memory with 4 XEON cores @ 2.93GHz
To shuffle data in 10 columns so that the 10 values per row are replaced with other values from other rows will be expensive.
You have to read 2 million rows 10 times.
The SELECT would be
I also wouldn't update, I'd create a new table
Then add keys etc, drop the old table, rename it. Or use a SYNONYM to point to the new table
If you want to update, then I'd do it like this. or break it up into 10 updates .
In order to update rows, there will be significant processsing time (CPU + I/O) from the updates.
Have you measured the relative expense of randomising the rows versus performing the updates?
In all you need to do is pick random rows, here's an efficient method to pick a random sample of rows (in this case 1% of the rows)
where
pkID
is your primary key column.This post might be of interest: