Parallel unnest() and sort order in PostgreSQL

2019-01-09 13:18发布

问题:

I understand that using

SELECT unnest(ARRAY[5,3,9]) as id

without an ORDER BY clause, the order of the result set is not guaranteed. I could for example get:

id
--
3
5
9

But what about the following request:

SELECT
  unnest(ARRAY[5,3,9]) as id,
  unnest(ARRAY(select generate_series(1, array_length(ARRAY[5,3,9], 1)))) as idx
ORDER BY idx ASC

Is it guaranteed that the 2 unnest() calls (which have the same length) will unroll in parallel and that the index idx will indeed match the position of the item in the array?

I am using PostgreSQL 9.3.3.

回答1:

Yes, that is a feature of Postgres and parallel unnesting is guaranteed to be in sync (as long as all arrays have the same number of elements).
Postgres 9.4 adds a clean solution for parallel unnest:

  • Unnest multiple arrays in parallel

The order of resulting rows is not guaranteed, though. Actually, with a statement as simple as:

SELECT unnest(ARRAY[5,3,9]) AS id

the resulting order of rows is "guaranteed", but Postgres does not assert anything. The query optimizer is free to order rows as it sees fit as long as the order is not explicitly defined. This may have side effects in more complex queries.

If the second query in your question is what you actually want (add an index number to unnested array elements), there is a better way with generate_subscripts():

SELECT unnest(ARRAY[5,3,9]) AS id
     , generate_subscripts(ARRAY[5,3,9], 1) AS idx
ORDER  BY idx;

Details in this related answer:

  • How to access array internal index with postgreSQL?

You will be interested in WITH ORDINALITY in Postgres 9.4:

  • PostgreSQL unnest() with element number

Then you can use:

SELECT * FROM unnest(ARRAY[5,3,9]) WITH ORDINALITY tbl(id, idx);


回答2:

Short answer: No, idx will not match the array positions, when accepting the premise that unnest() output may be randomly ordered.

Demo: since the current implementation of unnest actually output the rows in the order of elements, I suggest to add a layer on top of it to simulate a random order:

CREATE FUNCTION unnest_random(anyarray)  RETURNS setof anyelement
language sql as
$$ select unnest($1) order by random() $$;

Then check out a few executions of your query with unnest replaced by unnest_random:

SELECT
  unnest_random(ARRAY[5,3,9]) as id,
  unnest_random(ARRAY(select generate_series(1, array_length(ARRAY[5,3,9], 1)))) as idx
ORDER BY idx ASC

Example of output:

 id | idx 
----+-----
  3 |   1
  9 |   2
  5 |   3

id=3 is associated with idx=1 but 3 was in 2nd position in the array. It's all wrong.

What's wrong in the query: it assumes that the first unnest will shuffle the elements using the same permutation as the second unnest (permutation in the mathematic sense: the relationship between order in the array and order of the rows). But this assumption contradicts the premise that the order output of unnest is unpredictable to start with.

About this question:

Is it guaranteed that the 2 unnest() calls (which have the same length) will unroll in parallel

In select unnest(...) X1, unnest(...) X2, with X1 and X2 being of type SETOF something and having the same number of rows, X1 and X2 will be paired in the final output so that the X1 value at row N will face the X2 value at the same row N. (it's a kind of UNION for columns, as opposed to a cartesian product).

But I wouldn't describe this pairing as unroll in parallel, so I'm not sure this is what you meant.

Anyway this pairing doesn't help with the problem since it happens after the unnest calls have lost the array positions.

An alternative: In this thread from the pgsql-sql mailing list, this function is suggested:

CREATE OR REPLACE FUNCTION unnest_with_ordinality(anyarray, OUT value
anyelement, OUT ordinality integer)
  RETURNS SETOF record AS
$$
SELECT $1[i], i FROM
    generate_series(array_lower($1,1),
                    array_upper($1,1)) i;
$$
LANGUAGE sql IMMUTABLE; 

Based on this, we can order by the second output column:

select * from unnest_with_ordinality(array[5,3,9]) order by 2;
 value | ordinality 
-------+------------
     5 |          1
     3 |          2
     9 |          3

With postgres 9.4 and above: The WITH ORDINALITY clause that can follow SET RETURNING function calls will provide this functionality in a generic way.