Efficiently randomize (shuffle) data in Sql Server

2019-02-25 02:50发布

问题:

I have a table with data which I have to randomize. By randomizing, I mean use data from random row to update another row in that same column. Problem is that the table itself is big (more than 2 000 000 rows).

I wrote a piece of code which uses while loop, but it's going slow.

Does anyone have any suggestion about more efficient way of achieving randomization?

回答1:

In order to update rows, there will be significant processsing time (CPU + I/O) from the updates.

Have you measured the relative expense of randomising the rows versus performing the updates?

In all you need to do is pick random rows, here's an efficient method to pick a random sample of rows (in this case 1% of the rows)

SELECT * FROM myTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

where pkID is your primary key column.

This post might be of interest:

  • Randomising data


回答2:

To shuffle data in 10 columns so that the 10 values per row are replaced with other values from other rows will be expensive.

You have to read 2 million rows 10 times.

The SELECT would be

SELECT
    FirstName, LastName, VIN, ...
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

I also wouldn't update, I'd create a new table

SELECT
    FirstName, LastName, VIN, ...
INTO
    StagingTable
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

Then add keys etc, drop the old table, rename it. Or use a SYNONYM to point to the new table

If you want to update, then I'd do it like this. or break it up into 10 updates .

UPDATE
   M
SET
   Firstname = FirstName.FirstName,
   LastName = LastName.LastName,
   ...
FROM
    MyTable M
    JOIN 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...


回答3:

Based on Mitch Wheats answer linking to this article on scrambline data you can do something like this to scramble a bunch of fields, you're not just limited to the IDs:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [UserTable]
) 
UPDATE T1 
   SET [UserID] = T2.[UserID]
      ,[FirstName] = T2.[FirstName]
      ,[LastName] = T2.[LastName]
      ,[AddressLine1] =  T2.[AddressLine1]
      ,[AddressLine2] =  T2.[AddressLine2]
      ,[AddressLine3] =  T2.[AddressLine3]
      ,[City] = T2.[City]
      ,[State] = T2.[State]
      ,[Pincode] = T2.[Pincode]
      ,[PhoneNumber] = T2.[PhoneNumber]
      ,[MobileNumber] = T2.[MobileNumber]
      ,[Email] = T2.[Email]
      ,[Status] = T2.[Status] 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

So you aren't just limited to doing this as the article shows:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [MyTable]
) 
UPDATE T1 SET Id = T2.Id 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

The danger to this approach is the amount of data you are tweaking. Using a CTE crams all that stuff into memory, so while I found that this is fairly speedy ( 19 seconds for a 500k row table ). You will want to be careful if you have a table that has millions of records. You should consider how much data is actually needed, or is a good population sample, for testing and development.



回答4:

I combined the answers that I find above in a single query that randomizes each column again ending up with completely randomized records

UPDATE MyTable SET
  columnA = columnA.newValue,
  columnB = columnB.newValue,
  -- Optionally, for maintaining a group of values like street, zip, city in an address
  columnC = columnGroup.columnC,
  columnD = columnGroup.columnD,
  columnE = columnGroup.columnE
FROM MyTable
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable
) AS PKrows ON MyTable.id = PKrows.id
-- repeat the following JOIN for each column you want to randomize
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable
) AS columnA ON PKrows.rn = columnA.rn
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable
) AS columnB ON PKrows.rn = columnB.rn

-- Optionally, if you want to maintain a group of values spread out over several columns
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable
) AS columnGroup ON PKrows.rn = columnGroup.rn

This query took 8 seconds on a 10K rows table shuffling 8 columns on a Windows 2008 R2 machine with 16GB of memory with 4 XEON cores @ 2.93GHz