Why do these join differently based on size?

2020-04-11 18:06发布

问题:

In Postgresql, if you unnest two arrays of the same size, they line up each value from one array with one from the other, but if the two arrays are not the same size, it joins each value from one with every value from the other.

select unnest(ARRAY[1, 2, 3, 4, 5]::bigint[]) as id,
unnest(ARRAY['a', 'b', 'c', 'd', 'e']) as value

Will return

1 | "a"
2 | "b"
3 | "c"
4 | "d"
5 | "e"

But

select unnest(ARRAY[1, 2, 3, 4, 5]::bigint[]) as id, -- 5 elements
unnest(ARRAY['a', 'b', 'c', 'd']) as value -- 4 elements
order by id

Will return

1 | "a"
1 | "b"
1 | "c"
1 | "d"
2 | "b"
2 | "a"
2 | "c"
2 | "d"
3 | "b"
3 | "d"
3 | "a"
3 | "c"
4 | "d"
4 | "a"
4 | "c"
4 | "b"
5 | "d"
5 | "c"
5 | "b"
5 | "a"

Why is this? I assume some sort of implicit rule is being used, and I'd like to know if I can do it explicitly (eg if I want the second style when I have matching array sizes, or if I want missing values in one array to be treated as NULL).

回答1:

Support for set-returning functions in SELECT is a PostgreSQL extension, and an IMO very weird one. It's broadly considered deprecated and best avoided where possible.

Avoid using SRF-in-SELECT where possible

Now that LATERAL is supported in 9.3, one of the two main uses is gone. It used to be necessary to use a set-returning function in SELECT if you wanted to use the output of one SRF as the input to another; that is no longer needed with LATERAL.

The other use will be replaced in 9.4, when WITH ORDINALITY is added, allowing you to preserve the output ordering of a set-returning function. That's currently the main remaining use: to do things like zip the output of two SRFs into a rowset of matched value pairs. WITH ORDINALITY is most anticipated for unnest, but works with any other SRF.

Why the weird output?

The logic that PostgreSQL is using here (for whatever IMO insane reason it was originally introduced in ancient history) is: whenever either function produces output, emit a row. If only one function has produced output, scan the other one's output again to get the rows required. If neither produces output, stop emitting rows.

It's easier to see with generate_series.

regress=> SELECT generate_series(1,2), generate_series(1,2);
 generate_series | generate_series 
-----------------+-----------------
               1 |               1
               2 |               2
(2 rows)

regress=> SELECT generate_series(1,2), generate_series(1,3);
 generate_series | generate_series 
-----------------+-----------------
               1 |               1
               2 |               2
               1 |               3
               2 |               1
               1 |               2
               2 |               3
(6 rows)

regress=> SELECT generate_series(1,2), generate_series(1,4);
 generate_series | generate_series 
-----------------+-----------------
               1 |               1
               2 |               2
               1 |               3
               2 |               4
(4 rows)

In the majority of cases what you really want is a simple cross join of the two, which is a lot saner.

regress=> SELECT a, b FROM generate_series(1,2) a, generate_series(1,2) b;
 a | b 
---+---
 1 | 1
 1 | 2
 2 | 1
 2 | 2
(4 rows)

regress=> SELECT a, b FROM generate_series(1,2) a, generate_series(1,3) b;
 a | b 
---+---
 1 | 1
 1 | 2
 1 | 3
 2 | 1
 2 | 2
 2 | 3
(6 rows)

regress=> SELECT a, b FROM generate_series(1,2) a, generate_series(1,4) b;
 a | b 
---+---
 1 | 1
 1 | 2
 1 | 3
 1 | 4
 2 | 1
 2 | 2
 2 | 3
 2 | 4
(8 rows)

The main exception is currently for when you want to run multiple functions in lock-step, pairwise (like a zip), which you cannot currently do with joins.

WITH ORDINALITY

This will be improved in 9.4 with WITH ORDINALITY, a d while it'll be a bit less efficient than a multiple SRF scan in SELECT (unless optimizer improvements are added) it'll be a lot saner.

Say you wanted to pair up 1..3 and 10..40 with nulls for excess elements. Using with ordinality that'd be (PostgreSQL 9.4 only):

regress=# SELECT aval, bval 
           FROM generate_series(1,3) WITH ORDINALITY a(aval,apos) 
           RIGHT OUTER JOIN generate_series(1,4) WITH ORDINALITY b(bval, bpos) 
           ON (apos=bpos);

 aval | bval 
------+------
    1 |    1
    2 |    2
    3 |    3
      |    4
(4 rows)

wheras the srf-in-from would instead return:

regress=# SELECT generate_series(1,3) aval, generate_series(1,4) bval;
 aval | bval 
------+------
    1 |    1
    2 |    2
    3 |    3
    1 |    4
    2 |    1
    3 |    2
    1 |    3
    2 |    4
    3 |    1
    1 |    2
    2 |    3
    3 |    4
(12 rows)