- This challenge is based on a real life use-case involving IP ranges.
- The solution I came with is based on the stack trace challenge I've presented previously. Each range start is treated as a PUSH operation and each range end + 1 is treated as a POP operation.
The Challenge
We have a data set of ranges where each range has a starting point, ending point and a value.
create table ranges
(
range_start int not null
,range_end int not null
,range_val char(1) not null
)
;
A range can contain another range or follow another range, but cannot be equal to another range or intersect with another range.
These are valid relations between ranges:
(1) (2) (3) (4)
--------- --------- --------- -------- -----------
--- --- ---
These relations are not valid:
(5) (6)
------- --------
------- --------
Our initial ranges, when presented graphically, might look something like this (The letter represents range_val):
AAAAAAAA BBCCCCCCC
DDE F GGGGG
H IIII
J
The goal is to take the initial set of ranges and create a new set under the following rule:
A contained range will override the corresponding sub-range of the the containing range.
The requested result, when presented graphically, might look something like this
ADDHAAAF BIIJIGCCC
Requirements
- The solution should be a single SQL query (sub-queries are fine).
- The use of T-SQL, PL/SQL etc. is not allowed.
- The use of UDF (User Defined Functions) is not allowed.
Data Sample
AAAAAAAAAAAAAAAAAAAAAAAAAAAA BBBB CCCCCCCCCCCCCCCCCCCCCCCCC
DDDE FFFFFFFF GGGGGGGGG HHHHHHHH IIIIIII
JJ KKKLLL MM NN OOOOO
P QQ
insert into ranges (range_start,range_end,range_val) values (1 ,28 ,'A');
insert into ranges (range_start,range_end,range_val) values (31 ,34 ,'B');
insert into ranges (range_start,range_end,range_val) values (39 ,63 ,'C');
insert into ranges (range_start,range_end,range_val) values (1 ,3 ,'D');
insert into ranges (range_start,range_end,range_val) values (4 ,4 ,'E');
insert into ranges (range_start,range_end,range_val) values (7 ,14 ,'F');
insert into ranges (range_start,range_end,range_val) values (19 ,27 ,'G');
insert into ranges (range_start,range_end,range_val) values (43 ,50 ,'H');
insert into ranges (range_start,range_end,range_val) values (55 ,61 ,'I');
insert into ranges (range_start,range_end,range_val) values (1 ,2 ,'J');
insert into ranges (range_start,range_end,range_val) values (9 ,11 ,'K');
insert into ranges (range_start,range_end,range_val) values (12 ,14 ,'L');
insert into ranges (range_start,range_end,range_val) values (22 ,23 ,'M');
insert into ranges (range_start,range_end,range_val) values (25 ,26 ,'N');
insert into ranges (range_start,range_end,range_val) values (57 ,61 ,'O');
insert into ranges (range_start,range_end,range_val) values (13 ,13 ,'P');
insert into ranges (range_start,range_end,range_val) values (60 ,61 ,'Q');
Requested Results
(Nulls are presented here as empty spaces)
JJDEAAFFKKKLPLAAAAGGGMMGNNGA BBBB CCCCHHHHHHHHCCCCIIOOOQQCC
range_start range_end range_val
----------- --------- ---------
1 2 J
3 3 D
4 4 E
5 6 A
7 8 F
9 11 K
12 12 L
13 13 P
14 14 L
15 18 A
19 21 G
22 23 M
24 24 G
25 26 N
27 27 G
28 28 A
29 30
31 34 B
35 38
39 42 C
43 50 H
51 54 C
55 56 I
57 59 O
60 61 Q
62 63 C
Optional addition final row:
64
Oracle solution:
with l as ( select level lvl from dual connect by level < 66 ),
r as ( select range_start r1, range_end r2, range_val v,
range_end - range_start + 1 cnt
from ranges ),
t1 as (select distinct lvl,
nvl(max(v) keep (dense_rank first order by cnt)
over (partition by lvl), '*' ) m
from l left join r on lvl between r1 and r2 ),
t2 as (select lvl, m, case when lag(m) over (order by lvl) <> m then 0 else 1 end mrk
from t1),
t3 as (select lvl, m, lvl - sum(mrk) over (order by lvl) grp from t2)
select min(lvl) r1, max(lvl) r2, nullif(min(m), '*') val
from t3 group by grp order by r1
Output is as requested. My English is far from good, so it's hard to explain, but let's try:
l
- number generator,
r
- data from ranges
with counted distance,
t1
- finds value with minimal distance for each lvl,
t2
- adds markers telling if range starts,
t3
- adds column which we will next use for
grouping data.
- The solution is based on the stack trace challenge I've presented previously. Each range start is treated as a PUSH operation and each range end + 1 is treated as a POP operation.
- Performence wise, you may notice how the 2 internal analytic functions use the same windowing, therefore being executed in a single step.
Teradata
select new_range_start
,new_range_end
,last_value (range_val ignore nulls) over
(
partition by stack_depth
order by new_range_start ,range_start ,range_end desc
rows unbounded preceding
) as new_range_val
from (select new_range_start
,range_val
,range_start
,range_end
,sum (case when range_val is null then -1 else 1 end) over
(
order by new_range_start, range_start ,range_end desc
rows unbounded preceding
) as stack_depth
,min (new_range_start) over
(
order by new_range_start ,range_start ,range_end desc
rows between 1 following and 1 following
) - 1 as new_range_end
from ( select range_start ,range_start ,range_end ,range_val from ranges
union all select range_end + 1 ,range_start ,range_end ,cast (null as char(1)) from ranges
)
r (new_range_start,range_start,range_end,range_val)
)
r
qualify new_range_end >= new_range_start
order by new_range_start
;
Oracle
select new_range_start
,new_range_end
,new_range_val
from (select new_range_start
,new_range_end
,last_value (range_val ignore nulls) over
(
partition by stack_depth
order by new_range_start ,range_start ,range_end desc
rows unbounded preceding
) as new_range_val
from (select new_range_start
,range_start
,range_end
,range_val
,sum (case when range_val is null then -1 else 1 end) over
(
order by new_range_start, range_start ,range_end desc
rows unbounded preceding
) as stack_depth
,lead (new_range_start) over
(
order by new_range_start, range_start ,range_end desc
) - 1 as new_range_end
from ( select range_start as new_range_start ,range_start ,range_end ,range_val from ranges
union all select range_end + 1 ,range_start ,range_end ,cast (null as char(1)) from ranges
)
r
)
r
)
r
where new_range_end >= new_range_start
order by new_range_start
;
SQL Server / PostgreSQL / Hive
select *
from (select new_range_start
,new_range_end
,min (range_val) over
(
partition by stack_depth,new_range_val_group_id
) as new_range_val
from (select new_range_start
,new_range_end
,range_val
,stack_depth
,count (range_val) over
(
partition by stack_depth
order by new_range_start ,range_start ,range_end desc
rows unbounded preceding
) as new_range_val_group_id
from (select new_range_start
,range_start
,range_end
,range_val
,sum (case when range_val is null then -1 else 1 end) over
(
order by new_range_start, range_start ,range_end desc
rows unbounded preceding
) as stack_depth
,lead (new_range_start) over
(
order by new_range_start, range_start ,range_end desc
) - 1 as new_range_end
from ( select range_start as new_range_start ,range_start ,range_end ,range_val from ranges
union all select range_end + 1 as new_range_start ,range_start ,range_end ,cast (null as char(1)) as range_val from ranges
)
r
)
r
)
r
)
r
where new_range_end >= new_range_start
order by new_range_start
;
Oracle solution 2
WITH borders AS /*get all borders of interval*/
(SELECT DISTINCT DECODE(is_end, 0, range_start, range_end) AS border
,is_end
FROM ranges r,
(SELECT 0 AS is_end FROM dual UNION ALL
SELECT 1 AS is_end FROM dual)),
interv AS /*get all intervals*/
(SELECT border + is_end AS beg_int
,lead(border) over(ORDER BY border, is_end )
- lead(DECODE(is_end, 0, 1, 0)) over(ORDER BY border, is_end) AS end_int
FROM borders
ORDER BY 1)
SELECT i.beg_int
,i.end_int
,(SELECT MAX(r.range_val) keep (dense_rank FIRST ORDER BY r.range_end - r.range_start)
FROM ranges r
WHERE i.beg_int >= r.range_start AND i.end_int <= r.range_end) AS range_val
FROM interv i
WHERE beg_int <= end_int OR end_int IS NULL
ORDER BY i.beg_int;
Add solution without self join :
EDIT: fixed defect.
WITH intervals AS
(SELECT DECODE(is_end, -1, range_val, NULL) AS range_val
,DECODE(is_end, -1, range_start, range_end) AS border
,is_end
,- (SUM(is_end) over(ORDER BY DECODE(is_end, -1, range_start, range_end), is_end, (range_end - range_start) * is_end)) AS poss
,(range_end - range_start) * is_end AS ord2
FROM ranges r
,(SELECT -1 AS is_end FROM dual UNION ALL
SELECT 1 AS is_end FROM dual)),
range_stack AS
(SELECT border + DECODE(is_end, 1, 1, 0) AS begin_int
,lead(border) over(ORDER BY border, is_end, ord2)
+ DECODE(lead(is_end) over(ORDER BY border, is_end, ord2), 1, 0, -1) AS end_int
,last_value(range_val ignore NULLS) over(PARTITION BY poss ORDER BY border, is_end, ord2) AS range_val
FROM intervals)
SELECT begin_int
,end_int
,range_val
FROM range_stack
WHERE end_int >= begin_int
OR end_int IS NULL;