可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have the following problem: I am trying to discover all possible paths from source node (node_s) to target node (node_t).
The format of the original table with graph edges is simple: | node_x | node_y | strength | , where "node_x" -> "node_y" is a direct edge with strength of the edge being "weight".
The idea is if at any point during the exploration of the paths we discover that a node among its children has target node_t, we record this path and stop exploring paths from this node, otherwise continue exploration.
The simple solution was to use PostgreSQL's recursive CTE which constructs the transitive closure of the graph:
WITH RECURSIVE transitive_closure (source, target, weight, path_string) AS
(
--non-recurive term
SELECT b.node_x, b.node_y, b.strength, b.node_x || '.' || b.node_y || '.' AS path_string
FROM best_path b
WHERE b.node_x = node_s --source_node
UNION
--recurive term
SELECT tc.source, b.node_y, least(tc.weight,b.strength), tc.path_string || b.node_y || '.' AS path_string
FROM best_path AS b JOIN transitive_closure AS tc ON b.node_x = tc.target
WHERE tc.path_string NOT LIKE '%' || b.node_y || '.%'
)
SELECT *
FROM transitive_closure tc
WHERE tc.target = node_t --target_node
The code above will discover all possible paths from source node node_s. Only after transitive closure construction, I am able to select the needed rows of paths from source node to target node (see last SELECT statement).
Example:
best_path table has the following data:
node_x | node_y | strength
--------+--------+----------
1 | 2 | 1
1 | 3 | 1
2 | 4 | 1
2 | 5 | 1
3 | 6 | 1
3 | 7 | 1
4 | 8 | 1
4 | 9 | 1
5 | 10 | 1
5 | 11 | 1
query:
find the paths from source node = 1, to target node = 4
result:
source | target | strength | path_string
--------+--------+----------+------------
1 | 2 | 1 | 1.2.
1 | 3 | 1 | 1.3.
1 | 4 | 1 | 1.2.4.
1 | 5 | 1 | 1.2.5.
1 | 6 | 1 | 1.3.6.
1 | 7 | 1 | 1.3.7.
1 | 8 | 1 | 1.2.4.8.
1 | 9 | 1 | 1.2.4.9.
1 | 10 | 1 | 1.2.5.10.
1 | 11 | 1 | 1.2.5.11.
This is not what I need. Since there is a direct edge from node 2 to node 4 (target) already, I do not need paths 1.2.5., 1.2.4.8., 1.2.4.9.,1.2.5.10.,1.2.5.11., the path exploration for node 2 should stop at the point when the path from 2 to 4 was discovered.
To sum up, I do not want to discover the paths of the node if it already has a direct edge to the target node. This means that in a recursive term of CTE, I would like to have some condition that will say the following, pseudocode follows:
WITH RECURSIVE transitive_closure (source, target, weight, path_string) AS
(
--non-recurive term (as before)
SELECT b.node_x, b.node_y, b.strength, b.node_x || '.' || b.node_y || '.' AS path_string
FROM best_path b
WHERE b.node_x = node_s --source_node
UNION
--recurive term
SELECT tc.source, b.node_y, least(tc.weight,b.strength), tc.path_string || b.node_y || '.' AS path_string
FROM best_path AS b JOIN transitive_closure AS tc ON b.node_x = tc.target
WHERE tc.path_string NOT LIKE '%' || b.node_y || '.%'
AND b.node_y = node_t
--will select only rows with direct edge to target
UNION (second union is not allowed in CTE)
SELECT those rows which do not have direct edge to target
AND which parents did not contribute to constructing the query above.
i.e. b.node_x = tc.target where there is no direct edge between b.node_x to node_t
)
As a result of a query to find the paths from source node = 1, to target node = 4, I would like to have the following:
source | target | strength | path_string
--------+--------+----------+------------
1 | 2 | 1 | 1.2.
1 | 3 | 1 | 1.3.
1 | 4 | 1 | 1.2.4.
1 | 6 | 1 | 1.3.6.
1 | 7 | 1 | 1.3.7.
Thanks in advance for your help!
I have tried many ways already: e.g. conditions in FROM/WHERE clauses, trying to pass CTE into function, but no success.
Any suggestions will be appreciated.
I have my own recursive function that achieves what I want, however, it is very slow on enormous amount of data; and PostgreSQL's CTE apparently is well-optimized, so I would like to dig into it a bit more.
回答1:
You can make the searching of the path more efficient if you start at the bottom. Start at the children. If you start at the parent, it entails traversing all the children; whereas if you searched from the child, it has only one parent, hence won't waste time finding the path between source and target.
with recursive find_parent(source, target, recentness) as
(
select source, target, 0
from tbl
where target = 9
union all
select i.source, i.target, fp.recentness + 1
from tbl i
join find_parent fp on i.target = fp.source
),
construct_path(source, target, recentness, path) as
(
select source, target, recentness, source || '.' || target
from find_parent
where recentness = (select max(recentness) from find_parent)
union
select dd.source, dd.target, dd.recentness, cp.path || '.' || dd.target
from find_parent dd
join construct_path cp on dd.recentness = cp.recentness - 1
)
select source, target, path
from construct_path
order by recentness desc
Output:
SOURCE TARGET PATH
1 2 1.2
2 4 1.2.4
4 9 1.2.4.9
Live test: http://www.sqlfiddle.com/#!1/13e6b/1
Similar to this: How to get the parent given a child in SQL SERVER 2005
This is optimized, cut recursion to parent if it already find the specific one(source).
Source = 2
Target = 9
with recursive find_parent(source, target, recentness) as
(
select source, target, 0
from tbl
where target = 9
union all
select i.source, i.target, fp.recentness + 1
from tbl i
join find_parent fp on i.target = fp.source
-- despite the name, this target is another one's source
and i.target <> 2
)
,construct_path(source, target, recentness, path) as
(
select source, target, recentness, source || '.' || target
from find_parent
where recentness = (select max(recentness) from find_parent)
union
select dd.source, dd.target, dd.recentness, cp.path || '.' || dd.target
from find_parent dd
join construct_path cp on dd.recentness = cp.recentness - 1
)
select source, target, path
from construct_path
order by recentness desc
Output:
SOURCE TARGET PATH
2 4 2.4
4 9 2.4.9
Live test: http://www.sqlfiddle.com/#!1/13e6b/16
回答2:
Upon re-reading the OP's question, I come up with this solution:
source
with recursive -- this denotes that at least one CTE is using recursion
inputs as ( select 1 as d_source, 4 as d_target )
,traverse_path(filter, source, target, path, bingo) as
(
select bp.node_x, bp.node_x, bp.node_y, bp.node_x || '.' || bp.node_y,
max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
from best_path bp cross join inputs i
where bp.node_x = i.d_source -- filter
union
select tp.filter, bp.node_x, bp.node_y, path || '.' || node_y,
max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
from best_path bp cross join inputs i
join traverse_path tp on bp.node_x = tp.target
and not tp.bingo
)
select tp.*
from traverse_path tp cross join inputs i
-- if Postgresql has Oracle KEEP windowing,
-- we don't need to use WHERE clause here
where
not tp.bingo
or
(
tp.bingo and tp.target = d_target
)
The above WHERE clause could be shortened to:
WHERE not bingo or target = 4
Output:
FILTER SOURCE TARGET PATH BINGO
1 1 2 1.2 0
1 1 3 1.3 0
1 2 4 1.2.4 1
1 3 6 1.3.6 0
1 3 7 1.3.7 0
Live test: http://www.sqlfiddle.com/#!1/cdde6/55
Here's the result for Source = 2, Target = 5 :
FILTER SOURCE TARGET PATH BINGO
2 2 5 2.5 1
Data:
CREATE TABLE best_path
(node_x int, node_y int, strength int);
INSERT INTO best_path
(node_x, node_y, strength)
VALUES
(1, 2, 1),
(1, 3, 1),
(2, 4, 1),
(2, 5, 1),
(3, 6, 1),
(3, 7, 1),
(4, 8, 1),
(4, 9, 1),
(5, 10, 1),
(5, 11, 1);
回答3:
Temporary table for testing:
CREATE TEMP TABLE best_path (node_x int, node_y int, strength int);
INSERT INTO best_path VALUES
(1, 2, 1)
,(1, 3, 1)
,(2, 4, 1)
,(2, 5, 1)
,(3, 6, 1)
,(3, 7, 1)
,(4, 8, 1)
,(4, 9, 1)
,(5, 10, 1)
,(5, 11, 1);
Query:
modified to accomodate the comment about 2 - 5
WITH RECURSIVE t AS ( -- for readability and convenience:
SELECT 1 AS node_s -- source_node
, 4 AS node_t -- target_node
)
, x AS (
SELECT node_x
FROM t, best_path
WHERE node_y = node_t
)
, a AS (
SELECT b.node_x
, b.node_y
, b.strength AS weight
, b.node_x || '.' || b.node_y || '.' AS path
FROM t, best_path b
LEFT JOIN x ON x.node_x = b.node_x
WHERE b.node_x = node_s
AND (x.node_x IS NULL -- no point with edge to target
OR b.node_y = node_t) -- except with target_node
UNION ALL
SELECT a.node_x
, b.node_y
, least(a.weight, b.strength)
, a.path || b.node_y || '.' AS path
FROM t, a
JOIN best_path AS b ON b.node_x = a.node_y
LEFT JOIN x ON x.node_x = b.node_x
WHERE a.node_y <> node_t -- arrived at target
AND a.path !~~ ('%' || b.node_y || '.%') -- not visited yet
AND (x.node_x IS NULL -- no point with edge to target
OR b.node_y = node_t) -- except with target_node
)
TABLE a;
Produces the requested result exactly.
I added the break condition to the initial query, too, because we can reach the target in only one step.
This happens to be much like my answer to a previous, similar question. All the explanation and links apply. The major additional trick here is to include an additional CTE x
to collect the nodes that ...
have (a) direct edge to target.
For repeated use in the break condition of the recursive CTE. It is not widely known that you can add additional (non-recursive) CTEs on top of a recursive CTE regardless of the keyword RECURSIVE
.
回答4:
Optimized solution, no more WHERE clause on final result; albeit Postgresql-specific solution, i.e. we uses DISTINCT ON in order to pick specific row :
Given this data:
CREATE TABLE best_path
(node_x int, node_y int, strength int);
INSERT INTO best_path
(node_x, node_y, strength)
VALUES
(1, 2, 1),
(1, 3, 1),
(2, 4, 1),
(2, 5, 1),
(3, 6, 1),
(3, 7, 1),
(4, 8, 1),
(4, 9, 1),
(5, 10, 1),
(5, 11, 1),
(5, 12, 1);
Query, first stage, shows behind the scenes (source = 1, target = 11) :
with recursive -- this denotes that at least one CTE is using recursion
inputs as ( select 1 as d_source, 11 as d_target )
,traverse_path(filter, source, target, path, bingo, distincter) as
(
select
bp.node_x, bp.node_x, bp.node_y, bp.node_x || '.' || bp.node_y
,max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
,coalesce(
nullif(
max((bp.node_y = i.d_target)::int)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, 0)
,bp.node_y)
from best_path bp cross join inputs i
where bp.node_x = i.d_source -- filter
union
select
tp.filter, bp.node_x, bp.node_y, path || '.' || node_y
,max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
,coalesce(
nullif(
max((bp.node_y = i.d_target)::int)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, 0)
,bp.node_y)
from best_path bp cross join inputs i
join traverse_path tp on bp.node_x = tp.target
and not tp.bingo
)
select tp.*
from traverse_path tp
Output for source = 1, target = 11 : http://www.sqlfiddle.com/#!1/db290/56
FILTER SOURCE TARGET PATH BINGO DISTINCTER
1 1 2 1.2 0 2
1 1 3 1.3 0 3
1 2 4 1.2.4 0 4
1 2 5 1.2.5 0 5
1 3 6 1.3.6 0 6
1 3 7 1.3.7 0 7
1 4 8 1.2.4.8 0 8
1 4 9 1.2.4.9 0 9
1 5 11 1.2.5.11 1 1
1 5 10 1.2.5.10 1 1
1 5 12 1.2.5.12 1 1
As we can see, the target 11, is the first row among the source of 5. How this did happen? On our DISTINCTER column, we uses ORDER BY on MAX, this is one of the rare occasions that an ORDER on MAX windowing makes sense. We just use it in order to sort our result. I tried placing the ORDER BY at the end of the query, but the database complain that it cannot use ORDER on CTE. CTE forbids placing an ORDER BY clause. But as we all know, we can influence the sorting on OVER()
clause, so that's how our results get sorted. In the result above, among the source of 5, the number 11 sorts before 10 and 12, as 11 is our target. So when we do a DISTINCT ON
(Postgresql-specific feature), Postgres will pickup that first line, hence it will pickup the target 11.
This is our final query, optimized, albeit Postgresql-specific:
with recursive -- this denotes that at least one CTE is using recursion
inputs as ( select 1 as d_source, 11 as d_target )
,traverse_path(filter, source, target, path, bingo) as (
select
distinct on (
bp.node_x,
coalesce(
nullif(
max((bp.node_y = i.d_target)::int)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, 0)
,bp.node_y)
)
bp.node_x, bp.node_x, bp.node_y, bp.node_x || '.' || bp.node_y
,max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
from best_path bp cross join inputs i
where bp.node_x = i.d_source -- filter
union
select
distinct on (
bp.node_x,
coalesce(
nullif(
max((bp.node_y = i.d_target)::int)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, 0)
,bp.node_y)
)
tp.filter, bp.node_x, bp.node_y, path || '.' || node_y
,max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
from best_path bp cross join inputs i
join traverse_path tp on bp.node_x = tp.target
and not tp.bingo
) select tp.* from traverse_path tp
Output for source = 1, target = 11 http://www.sqlfiddle.com/#!1/db290/55
FILTER SOURCE TARGET PATH BINGO
1 1 2 1.2 0
1 1 3 1.3 0
1 2 4 1.2.4 0
1 2 5 1.2.5 0
1 3 6 1.3.6 0
1 3 7 1.3.7 0
1 4 8 1.2.4.8 0
1 4 9 1.2.4.9 0
1 5 11 1.2.5.11 1
Output for source = 1, target = 4 : http://www.sqlfiddle.com/#!1/db290/53
FILTER SOURCE TARGET PATH BINGO
1 1 2 1.2 0
1 1 3 1.3 0
1 2 4 1.2.4 1
1 3 6 1.3.6 0
1 3 7 1.3.7 0
Output for source = 2, target = 5 : http://www.sqlfiddle.com/#!1/db290/54
FILTER SOURCE TARGET PATH BINGO
2 2 5 2.5 1
Another approach, instead of BINGO approach. Use continue_search as the logic for continuing the travesal. And use EVERY aggregate function to determine if we need to continue traversing the graph.
Here's how it works: http://www.sqlfiddle.com/#!1/db290/84
Final query: http://www.sqlfiddle.com/#!1/db290/85
It's interesting to note that EVERY is very English-like as it gets:
Contrast this:
,max((bp.node_y = i.d_target)::int) over(partition by bp.node_x) ::bool
With using EVERY :
,every(bp.node_y <> i.d_target) over(partition by bp.node_x)
Which one is easier to read?
Here's the output that illustrates the principle of using EVERY to facilitate DISTINCT ON:
Source = 1, Target = 5. Note that 5 sorts first before other numbers that belongs to the same source, this will later be picked on by DISTINCT ON.
FILTER SOURCE TARGET PATH CONTINUE_SEARCH DISTINCTER
1 1 2 1.2 1 2
1 1 3 1.3 1 3
1 2 5 1.2.5 0 0
1 2 4 1.2.4 0 0
1 3 6 1.3.6 1 6
1 3 7 1.3.7 1 7
Here's the query that does that principle:
with recursive -- this denotes that at least one CTE is using recursion
inputs as ( select 1 as d_source, 5 as d_target )
,traverse_path(filter, source, target, path, continue_search, distincter) as
(
select
bp.node_x, bp.node_x, bp.node_y, concat(bp.node_x , '.' , bp.node_y )
,every(bp.node_y <> i.d_target) over(partition by bp.node_x)
,coalesce(
cast(nullif(
every(bp.node_y <> i.d_target)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, true) as int)
,bp.node_y)
from best_path bp cross join inputs i
where bp.node_x = i.d_source -- filter
union
select
tp.filter, bp.node_x, bp.node_y, concat(path , '.' , node_y)
,every(bp.node_y <> i.d_target) over(partition by bp.node_x)
,coalesce(
cast(nullif(
every(bp.node_y <> i.d_target)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, true) as int)
,bp.node_y)
from best_path bp cross join inputs i
join traverse_path tp on bp.node_x = tp.target
and tp.continue_search
)
select tp.*
from traverse_path tp
Final query:
with recursive -- this denotes that at least one CTE is using recursion
inputs as ( select 1 as d_source, 5 as d_target )
,traverse_path(filter, source, target, path, continue_search) as
(
select
distinct on
(
bp.node_x
,coalesce(
cast(nullif(
every(bp.node_y <> i.d_target)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, true) as int)
,bp.node_y)
)
bp.node_x, bp.node_x, bp.node_y, concat(bp.node_x , '.' , bp.node_y )
,every(bp.node_y <> i.d_target) over(partition by bp.node_x)
from best_path bp cross join inputs i
where bp.node_x = i.d_source -- filter
union
select
distinct on
(
bp.node_x
,coalesce(
cast(nullif(
every(bp.node_y <> i.d_target)
over(partition by bp.node_x order by bp.node_x, bp.node_y = d_target desc, bp.node_y)
, true) as int)
,bp.node_y)
)
tp.filter, bp.node_x, bp.node_y, concat(path , '.' , node_y)
,every(bp.node_y <> i.d_target) over(partition by bp.node_x)
from best_path bp cross join inputs i
join traverse_path tp on bp.node_x = tp.target
and tp.continue_search
)
select tp.*
from traverse_path tp
Output:
FILTER SOURCE TARGET PATH CONTINUE_SEARCH
1 1 2 1.2 1
1 1 3 1.3 1
1 2 5 1.2.5 0
1 3 6 1.3.6 1
1 3 7 1.3.7 1