I want to "left join" a table so that a value is joined not just to a matching row, but also to any subsequent non-matching rows, up to the next matching row. To put it another way, I want to fill in nulls with the previous non-null value.
Sample data and desired result:
Table x
:
id
----
1
2
3
4
5
Table y
:
id | val
----+-----
1 | a
4 | b
Result of select x.id, y.val from x left join y on x.id=y.id order by x.id;
:
id | val
----+-----
1 | a
2 |
3 |
4 | b
5 |
Desired result:
id | val
----+-----
1 | a
2 | a
3 | a
4 | b
5 | b
Indices
Create indices on x.id
and y.id
- which you probably already have if those are your primary keys.
A multi-column index may help, too, especially with index only scans in pg 9.2+:
CREATE INDEX y_mult_idx ON y (id DESC, val)
However, in my tests, this index was not used at first. Had to add (otherwise pointless) val
to ORDER BY
to convince the query planner that the sort order matches. See query 3.
The index makes little difference in this synthetic setup. But for tables with more columns, retrieving val
from the table becomes increasingly expensive, making the "covering" index more attractive.
Queries
1) Simple
SELECT DISTINCT ON (x.id)
x.id, y.val
FROM x
JOIN y ON y.id <= x.id
ORDER BY x.id, y.id DESC;
SQL Fiddle.
More explanation for the technique with DISTINCT
in this related answer:
- Select first row in each GROUP BY group?
I ran some tests because I had my suspicions that the first query wouldn't scale well. It's fast with a small table, but no good with bigger tables. Postgres doesn't optimize the plan and starts with a (limited) cross join, with a cost of O(N²)
.
2) Fast
This query is still rather simple and scales excellently:
SELECT x.id, y.val
FROM x
JOIN (SELECT *, lead(id, 1, 2147483647) OVER (ORDER BY id) AS next_id FROM y) y
ON x.id >= y.id
AND x.id < y.next_id
ORDER BY 1;
The window function lead()
is instrumental. I make use of the option to provide a default to cover the corner case of the last row: 2147483647
is the biggest possible integer. Adapt to your data type.
3) Very simple and almost as fast
SELECT x.id
,(SELECT val FROM y WHERE id <= x.id ORDER BY id DESC, val LIMIT 1) AS val
FROM x;
Normally, correlated subqueries tend to be slow. But this one can just pick a value from the (covering) index and is otherwise so simple that it can compete.
The additional ORDER BY
item val
(bold emphasis) seems pointless. But adding it convinces the query planner that it's ok to use the multi-column index y_mult_idx
from above, because the sort order matches. Note the
Index Only Scan using y_mult_idx ..
in the EXPLAIN
output.
Test case
After a lively debate and multiple updates I collected all queries posted so far and made a test case for a quick overview. I only use 1000 rows so SQLfiddle does not time out with the slower queries. But the top 4 (Erwin 2, Clodoaldo, a_horse, Erwin 3) scale linearly in all my local tests.
Updated once more to include my latest addition, improve format and order by performance now:
Big SQL Fiddle comparing performance.
SQL Fiddle
select
id,
first_value(val) over(partition by g order by id) val
from (
select
x.id, val, count(val) over(order by x.id) g
from
x
left join
y on x.id=y.id
) s
order by id
select id,
first_value(t.val) over (partition by group_flag order by t.id) as val
from (
select x.id,
y.val,
sum(case
when y.val is null then 0
else 1
end) over (order by x.id) as group_flag
from x
left join y on x.id=y.id
) t
order by id;
SQLFiddle example: http://sqlfiddle.com/#!12/38903/1
SELECT
m.id,
y.val
FROM (
SELECT
x.id,
MAX(y.id) id_y
FROM
x INNER JOIN y ON x.id >= y.id
GROUP BY
x.id
) m INNER JOIN y ON m.id_y = y.id
ORDER BY
m.id
Please see fiddle here.
I like to express the aggregate functions MIN(), MAX(), or closer_to()
in terms of (NOT) EXISTS.
SELECT x.id, y.val
FROM x
JOIN y ON y.id <= x.id
WHERE NOT EXISTS (SELECT *
FROM y y2
WHERE y2.id <= x.id -- same condition AS main query
AND y2.id > y.id -- but closer to x.id
)
;
My gut feeling is that this will generate exactly the same query plan as Erwin's answer.
Use COALESCE and Subquery for the logic.
The subquery will retrieve the last val value.
Try this:
SELECT x1.id,
coalesce(y1.val, (SELECT val
FROM y
WHERE id = (SELECT Max(x2.id)
FROM x AS x2
JOIN y AS y2
ON x2.id = y2.id
WHERE x2.id < x1.id)))
FROM x AS x1
LEFT JOIN y AS y1
ON x1.id = y1.id
ORDER BY x1.id;
sqlfiddle : http://www.sqlfiddle.com/#!12/42526/1
I am not sure how you would achieve this using single stored procedure. Logic similar to following would return you the desired result
create PROCEDURE GetData
AS
BEGIN
Declare @resultTable TABLE(
id int,
value varchar(10))
Declare @id int
Declare @previousValue varchar(10)
Declare @currentValue varchar(10)
DECLARE x_cursor CURSOR
FOR SELECT id FROM x order by id
OPEN x_cursor
FETCH NEXT FROM x_cursor into @id;
WHILE (@@FETCH_STATUS = 0)
BEGIN
select @currentValue = isnull(val,@previousValue) from Y where id = @id
insert into @resultTable values(@id,@currentValue)
set @previousValue = @currentValue
FETCH NEXT FROM x_cursor into @id;
END
Close x_cursor
Deallocate x_cursor
select * from @resultTable
END
GO