I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".
Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.
I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.
Please note:
- Due to company restrictions, I cannot use stored procedures.
- I can't use user defined functions (UDFs)
- I can't use user defined types (UDTs)
A correct answer will
- Be written for Teradata or compatible
- Be more efficient than my answer
- Respect the restrictions I mention above
Sample Input:
Desired Output:
You need a recursive apporach, but your WITH RECURSIVE
creates a humongous intermediate result, which leads to no more spool.
For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):
CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log AS
(
SELECT group_id, category_1, category_2,
-- assign a unique number to
Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk
-- remove when you source data is unique
GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
FROM match_detail
)
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;
Now repeat the following update until 0 rows processed
:
-- find matching categories and assign them a common number
UPDATE vt_tmp FROM
( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
FROM vt_tmp e1 JOIN vt_tmp e2
ON e1.category_2 = e2.category_2
AND e1.rnk < e2.rnk
GROUP BY e2.group_id, e2.category_1
) x
SET rnk = minrnk
WHERE
vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;
To get the related categories you finally need:
SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk
FROM vt_tmp
For an exact match of your expected result you need to add a DENSE_RANK
:
SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
(
SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk
FROM vt_tmp
) AS dt
The concept of this solution is to avoid cycles while traversing through the edges.
It is done by computing the path's bitmap on the run and avoiding adding an edge which its category_2 is already in the bitmap.
All path starts with a self-referencing edge (e.g. 'A'--'A') in order to take the first node bitmap.
This solution lowers the number of result records of the iterative sub-query (edges) to 70.
The limitation of this solution is that the bitmap size should be defined in advance.
Every digit in the varbinary literal stands for 4 bits.
64 bit = '0000000000000000'xb
128 bit = '00000000000000000000000000000000'xb
etc.
with category_group_bitmap (group_id,category_1,bitflag_group)
as
(
select group_id
,category_1
,dense_rank () over
(
partition by group_id
order by sum (distinct category_2_n)
) as bitflag_group
from edges
group by group_id
,category_1
)
,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
as
(
select 1 as n
,m.group_id
,m.category_1
,m.category_2
,gc.categories_num
,setbit ('0000000000000000'xb,m.category_2_n) as bitmap
,m.category_2_n
from match_detail_category_2_n as m
join group_categories_num as gc
on gc.group_id =
m.group_id
where m.category_1 =
m.category_2
union all
select e.n + 1 as n
,e.group_id
,e.category_1
,m.category_2
,e.categories_num
,setbit (e.bitmap,m.category_2_n) as bitmap
,m.category_2_n
from edges as e
join match_detail_category_2_n as m
on m.group_id =
e.group_id
and m.category_1 =
e.category_2
where e.n < e.categories_num - 1
and getbit (e.bitmap,m.category_2_n) = 0
)
,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
as
(
select m.group_id
,category_1
,category_2
,cast
(
dense_rank () over
(
partition by group_id
order by category_2
) - 1
as byteint
)
from match_detail as m
)
,group_categories_num (group_id,categories_num)
as
(
select group_id
,count (distinct category_1)
from match_detail
group by group_id
)
select *
from category_group_bitmap
;
This works, but causes an "out of spool space" issue on real data.
Schema creation:
CREATE VOLATILE TABLE match_detail (
group_id bigint
, category_1 varchar(255)
, category_2 varchar(255)
) PRIMARY INDEX (group_id)
ON COMMIT PRESERVE ROWS;
INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');
Query:
WITH
related_cats AS (
SELECT
group_id
, category_1
, SUM(DISTINCT bitflag) As bitflag_total
, DENSE_RANK() OVER (
PARTITION BY
group_id
ORDER BY
group_id
, bitflag_total
) As bitflag_group
FROM bitflags
GROUP BY 1, 2
)
, bitflags As (
SELECT
DISTINCT
group_id
, category_1
, category_2
, CAST
(
2 ** (DENSE_RANK() OVER (
PARTITION BY
group_id
ORDER BY group_id
, category_2) - 1)
As bigint
) As bitflag
FROM cat_join
WHERE depth = 1
)
, RECURSIVE cat_join AS (
SELECT DISTINCT
c1.group_id
, c1.category_1
, c1.category_2
, CAST
(
n.num_categories - 1
As integer
) As max_depth
, CASE
WHEN c1.category_1 = c1.category_2 THEN 1
ELSE max_depth
END As depth
, 1 As recursion
FROM matches c1
INNER JOIN num_categories n
ON c1.group_id = n.group_id
UNION ALL
SELECT
r1.group_id
, r1.category_1
, r2.category_2
, r1.max_depth
, CASE
WHEN r1.category_1 = r1.category_2 THEN 1
WHEN r1.category_1 = r2.category_2 THEN 1
ELSE r1.depth - 1
END As cur_depth
, recursion + 1
FROM cat_join r1
INNER JOIN matches r2
ON r1.group_id = r2.group_id
AND r1.category_2 = r2.category_1
WHERE
r1.category_1 <> r2.category_2
AND r1.category_1 <> r2.category_1
AND
(r1.depth - 1) > 0
)
, matches AS (
SELECT
d.group_id
, d.category_1
, d.category_2
FROM match_detail d
GROUP BY d.group_id, d.category_1, d.category_2
)
, num_categories AS (
SELECT
u.group_id
, COUNT(DISTINCT u.category_2) AS num_categories
FROM categories u
GROUP BY u.group_id
)
, categories AS (
SELECT DISTINCT
u.group_id
, u.category_1
, u.category_2
FROM match_detail u
)
SELECT *
FROM related_cats