Populate random data from another table

2019-02-23 09:47发布

问题:

update dataset1.test
   set column4 = (select column1 
                 from dataset2
                 order by random()
                 limit 1
                 ) 

I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.

回答1:

SETUP

Let's start by assuming your tables an data are the following ones. Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):

CREATE TABLE dataset1
(
     id INTEGER PRIMARY KEY,
     column4 TEXT
) ;

CREATE TABLE dataset2
(
    column1 TEXT
) ;

We fill both tables with sample data

INSERT INTO dataset1
    (id, column4)
SELECT
    i, 'column 4 for id ' || i
FROM
    generate_series(101, 120) AS s(i);

INSERT INTO dataset2
    (column1)
SELECT
    'SOMETHING ' || i
FROM 
    generate_series (1001, 1020) AS s(i) ;

Sanity check:

SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
|    20 |

Case 1: number of rows in dataset1 <= rows in dataset2

We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.

EXPLANATION

In order to make an update that shuffles all the values from column4 in a random fashion, we need some intermediate steps.

First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that are just:

(id_1,   1),
(id_2,   2),
(id_3,   3),
...
(id_20, 20)

Where id_1, ..., id_20 are the ids present on dataset1. They can be of any type, they need not be consecutive, and they can be composite.

For the dataset2, we need to create another list of (column_1,rn), that looks like:

(column1_1,  17),
(column1_2,   3),
(column1_3,  11),
...
(column1_20, 15)

In this case, the second column contains all the values 1 .. 20, but shuffled.

Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.

THE REAL QUERY

This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:

WITH original_keys AS
(
    -- This creates tuples (id, rn), 
    -- where rn increases from 1 to number or rows
    SELECT 
        id, 
        row_number() OVER  () AS rn
    FROM 
        dataset1
)
, shuffled_data AS
(
    -- This creates tuples (column1, rn)
    -- where rn moves between 1 and number of rows, but is randomly shuffled
    SELECT 
        column1,
        -- The next statement is what *shuffles* all the data
        row_number() OVER  (ORDER BY random()) AS rn
    FROM 
        dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
    dataset1
SET
    column4 = shuffled_data.column1
FROM
    shuffled_data
    JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
    dataset1.id = original_keys.id ;

Note that the trick is performed by means of:

row_number() OVER (ORDER BY random()) AS rn

The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1. These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.

CHECKS

We can check again:

SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
|    20 |
SELECT * FROM dataset1 ;
 id | column4       
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011

ALTERNATIVE

Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:

UPDATE
    dataset1
SET
    column4 = shuffled_data.column1
FROM
    (SELECT 
        column1,
        row_number() OVER  (ORDER BY random()) AS rn
    FROM 
        dataset2
    ) AS shuffled_data
    JOIN 
    (SELECT 
        id, 
        row_number() OVER  () AS rn
    FROM 
        dataset1
    ) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
    dataset1.id = original_keys.id ;

And again...

SELECT * FROM dataset1;
 id | column4       
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016

You can check the whole setup and experiment at dbfiddle here

NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.


Case 2: number of rows in dataset1 > rows in dataset2

In this case, values for column4 can be repeated several times.

The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:

CREATE FUNCTION random_column1() 
    RETURNS TEXT
    VOLATILE      -- important!
    LANGUAGE SQL
AS
$$
    SELECT
        column1
    FROM
        dataset2
    ORDER BY
        random()
    LIMIT
        1 ;
$$ ;

And use it to update:

UPDATE
    dataset1
SET
    column4 = random_column1();

This way, some values from dataset2 might not be used at all, whereas others will be used more than once.

dbfiddle here



回答2:

Better is to reference the outer table from the subquery. Then the subquery has to be evalued for every row:

update dataset1.test
   set column4 = (select
        case when dataset1.test.column4 = dataset1.test.column4
             then column1 end
        from dataset2
        order by random()
        limit 1
   )