#UKOUG_Tech16: Ranges, Ranges Everywhere

I presented this topic on at Tech 16 on Monday afternoon. The latest version is on the Tech 16 website and on http://www.slideshare.net/StewAshton1/ranges-ranges-everywhere-oracle-sql

Thanks to the organisers, staff, presenters and delegates who made this a great conference and a wonderful experience!

#DOAG2016: Ranges, Ranges Everywhere!

Today I did my second presentation at DOAG2016. It was at 9:00 so I got to sleep in ;)

The room was huge but there were enough people that I didn’t feel too lonely.

The room and the technical help were top notch, and again there were questions at just the right time to remind me of things I might have left out!

[Update 2016-11-20: there was a bug on slide 21 ! Instead of “order by end_n” it should be “order by start_n”. I have updated the slide on the DOAG website and on slideshare. My test cases were not thorough; my apologies…]

As promised, I put the presentation on Slideshare. Here is the link:

I will be blogging about some of the content later on, so stay tuned…

As always, please download the slides and play them in Powerpoint so you can see the animations.

Thanks to DOAG for the invite!

Joining Temporal Tables 3: Gaps

This is my third post about temporal tables that don’t allow overlaps. I’m going to add support for temporal gaps, which will also allow objects to stop existing at some point.

Test Data

I’m going to add a NOT NULL numeric field called START_GAP, which can be 0 or 1. When START_GAP = 0, the row contains data that is valid starting at FROM_DATE. When START_GAP = 1, then FROM_DATE is actually the end date of the previous row.

The data can now contain all the relations between time intervals that I listed previously.

alter session set nls_date_format='dd';
drop table a purge;

create table a (
  case_id varchar2(32),
  from_date date,
  start_gap number(1,0) not null check(start_gap in (0,1)),
  a_string varchar2(32),
  primary key(case_id, from_date)
);
Insert into A values ('01-precedes',to_date('2016-01-01','yyyy-mm-dd'),0,'A01-precedes');
Insert into A values ('01-precedes',to_date('2016-01-02','yyyy-mm-dd'),1,'A01-precedes');
Insert into A values ('02-meets',to_date('2016-01-01','yyyy-mm-dd'),0,'A02-meets');
Insert into A values ('02-meets',to_date('2016-01-02','yyyy-mm-dd'),1,'A02-meets');
Insert into A values ('03-overlaps',to_date('2016-01-01','yyyy-mm-dd'),0,'A03-overlaps');
Insert into A values ('03-overlaps',to_date('2016-01-03','yyyy-mm-dd'),1,'A03-overlaps');
Insert into A values ('04-finished by',to_date('2016-01-01','yyyy-mm-dd'),0,'A04-finished by');
Insert into A values ('04-finished by',to_date('2016-01-03','yyyy-mm-dd'),1,'A04-finished by');
Insert into A values ('05-contains',to_date('2016-01-01','yyyy-mm-dd'),0,'A05-contains');
Insert into A values ('05-contains',to_date('2016-01-04','yyyy-mm-dd'),1,'A05-contains');
Insert into A values ('06-starts',to_date('2016-01-01','yyyy-mm-dd'),0,'A06-starts');
Insert into A values ('06-starts',to_date('2016-01-02','yyyy-mm-dd'),1,'A06-starts');
Insert into A values ('07-equals',to_date('2016-01-01','yyyy-mm-dd'),0,'A07-equals');
Insert into A values ('07-equals',to_date('2016-01-02','yyyy-mm-dd'),1,'A07-equals');

drop table b purge;

create table b (
  case_id varchar2(32),
  from_date date,
  start_gap number(1,0) not null check(start_gap in (0,1)),
  b_string varchar2(32),
  primary key(case_id, from_date)
);

Insert into B values ('01-precedes',to_date('2016-01-03','yyyy-mm-dd'),0,'B01-precedes');
Insert into B values ('01-precedes',to_date('2016-01-04','yyyy-mm-dd'),1,'B01-precedes');
Insert into B values ('02-meets',to_date('2016-01-02','yyyy-mm-dd'),0,'B02-meets');
Insert into B values ('02-meets',to_date('2016-01-03','yyyy-mm-dd'),1,'B02-meets');
Insert into B values ('03-overlaps',to_date('2016-01-02','yyyy-mm-dd'),0,'B03-overlaps');
Insert into B values ('03-overlaps',to_date('2016-01-04','yyyy-mm-dd'),1,'B03-overlaps');
Insert into B values ('04-finished by',to_date('2016-01-02','yyyy-mm-dd'),0,'B04-finished by');
Insert into B values ('04-finished by',to_date('2016-01-03','yyyy-mm-dd'),1,'B04-finished by');
Insert into B values ('05-contains',to_date('2016-01-02','yyyy-mm-dd'),0,'B05-contains');
Insert into B values ('05-contains',to_date('2016-01-03','yyyy-mm-dd'),1,'B05-contains');
Insert into B values ('06-starts',to_date('2016-01-01','yyyy-mm-dd'),0,'B06-starts');
Insert into B values ('06-starts',to_date('2016-01-03','yyyy-mm-dd'),1,'B06-starts');
Insert into B values ('07-equals',to_date('2016-01-01','yyyy-mm-dd'),0,'B07-equals');
Insert into B values ('07-equals',to_date('2016-01-02','yyyy-mm-dd'),1,'B07-equals');
commit;

select * from a
union all
select * from b
order by 1,2,4;
CASE_ID FROM_DATE START_GAP A_STRING or B_STRING
01-precedes 01 0 A01-precedes
01-precedes 02 1 A01-precedes
01-precedes 03 0 B01-precedes
01-precedes 04 1 B01-precedes
02-meets 01 0 A02-meets
02-meets 02 1 A02-meets
02-meets 02 0 B02-meets
02-meets 03 1 B02-meets
03-overlaps 01 0 A03-overlaps
03-overlaps 02 0 B03-overlaps
03-overlaps 03 1 A03-overlaps
03-overlaps 04 1 B03-overlaps
04-finished by 01 0 A04-finished by
04-finished by 02 0 B04-finished by
04-finished by 03 1 A04-finished by
04-finished by 03 1 B04-finished by
05-contains 01 0 A05-contains
05-contains 02 0 B05-contains
05-contains 03 1 B05-contains
05-contains 04 1 A05-contains
06-starts 01 0 A06-starts
06-starts 01 0 B06-starts
06-starts 02 1 A06-starts
06-starts 03 1 B06-starts
07-equals 01 0 A07-equals
07-equals 01 0 B07-equals
07-equals 02 1 A07-equals
07-equals 02 1 B07-equals

 

What’s the idea?

As before, we need to carry down values and eliminate certain rows. How do we manage that for the new rows where START_GAP = 1?

  1. For table A, if the most recent START_GAP is 1 then A_STRING should be NULL. Same for B.
  2. If the most recent START_GAP is 1 for both tables, then that row should be eliminated.
  3. If the first row is from A, then B.START_GAP will be NULL. In that case B.START_GAP should be made = 1, since the B object did not exist at that point.
  4. As before, TO_DATE must be NULL or greater than FROM_DATE, otherwise the row is eliminated.

The MATCH_RECOGNIZE solution

Note that I don’t create a ROWTYPE column anymore: since START_GAP is defined as NOT NULL, I can use A.START_GAP to determine whether the current row comes from table A.

select case_id, from_date, to_date, a_str a_string, b_str b_string
from (
  select case_id, from_date, start_gap agap, a_string,
    null bgap, null b_string
  from a
  union all
  select case_id, from_date, null, null, start_gap bgap, b_string
  from b
)
match_recognize(
  partition by case_id order by from_date
  measures next(from_date) to_date,
    nvl(a.agap,1) + nvl(b.bgap,1) abgap,
    case a.agap when 0 then a.a_string end a_str,
    case b.bgap when 0 then b.b_string end b_str
  all rows per match
  pattern( (a|b)+ )
  define a as agap is not null
)
where (to_date > from_date or to_date is null)
  and abgap < 2;
CASE_ID FROM_DATE TO_DATE A_STRING B_STRING
01-precedes 01 02 A01-precedes
01-precedes 03 04 B01-precedes
02-meets 01 02 A02-meets
02-meets 02 03 B02-meets
03-overlaps 01 02 A03-overlaps
03-overlaps 02 03 A03-overlaps B03-overlaps
03-overlaps 03 04 B03-overlaps
04-finished by 01 02 A04-finished by
04-finished by 02 03 A04-finished by B04-finished by
05-contains 01 02 A05-contains
05-contains 02 03 A05-contains B05-contains
05-contains 03 04 A05-contains
06-starts 01 02 A06-starts B06-starts
06-starts 02 03 B06-starts
07-equals 01 02 A07-equals B07-equals

 

The Analytic solution

This one took me some time to figure out. I broke it down into 3 logical steps using WITH clauses.

  • ALL_ROWS just does the UNION ALL, while transforming NULL values to CHR(0)
  • ANALYTIC_ROWS applies all the analytic functions
  • The main SELECT transforms CHR(0) to NULL, makes a value NULL if the most recent START_GAP = 1, and eliminates excess rows.
with all_rows as (
  select case_id, from_date, start_gap agap, nvl(a_string, chr(0)) a_string,
    null bgap, null b_string
  from a
  union all
  select case_id, from_date, null, null,
    start_gap bgap, nvl(b_string, chr(0)) b_string
  from b
)
, analyzed_rows as (
  select case_id, from_date,
    lead(from_date) over(partition by case_id order by from_date) to_date,
    last_value(agap) ignore nulls
      over(partition by case_id order by from_date) agap,
    last_value(bgap) ignore nulls
      over(partition by case_id order by from_date) bgap,
    last_value(a_string) ignore nulls
      over(partition by case_id order by from_date) a_string,
    last_value(b_string) ignore nulls
      over(partition by case_id order by from_date) b_string
  from all_rows
)
select case_id, from_date, to_date,
  case when agap = 0 then nullif(a_string, chr(0)) end a_string,
  case when bgap = 0 then nullif(b_string, chr(0)) end b_string
from analyzed_rows
where (to_date > from_date or to_date is null)
  and nvl(agap,1) + nvl(bgap,1) < 2;

There’s more?

Yes indeed, I’m not through yet.

At this point I can define temporal tables that allow gaps but not overlaps. The attributes can be NULL but FROM_DATE can’t. The TO_DATE is not stored in the same row, but it can be displayed that way. I can also “join” two such tables in an efficient manner.

I only have one problem: your tables probably aren’t built that way. You have FROM_DATE and TO_DATE in each row, and if you say you don’t have overlapping ranges then you want me to believe you.

In my next post I’ll try to “join” tables that store FROM_DATE and TO_DATE, assuming there are no overlaps.

MATCH_RECOGNIZE: gaps in date ranges

To find gaps between date ranges, there is an elegant solution using analytic functions. We can follow the same logic with MATCH_RECOGNIZE, using a technique that avoids the problems with MIN() and MAX() comparisons I just wrote about.

Here are some simple test data

create table t ( start_date date, end_date date );
Insert into T values (DATE '2007-01-01', DATE '2007-01-15');
Insert into T values (DATE '2007-01-03', DATE '2007-01-10');
Insert into T values (DATE '2007-01-12', DATE '2007-01-25');
Insert into T values (DATE '2007-01-20', DATE '2007-02-01');
Insert into T values (DATE '2007-02-05', DATE '2007-02-10');
Insert into T values (DATE '2007-02-05', DATE '2007-02-28');
Insert into T values (DATE '2007-02-10', DATE '2007-02-15');
Insert into T values (DATE '2007-03-01', DATE '2007-03-02');
Insert into T values (DATE '2007-03-03', DATE '2007-03-16');

Please note that I use exclusive end dates: the second to last range ends when March 2d starts, so there should be a gap from March 2d to March 3d.

Using analytics, I compare the maximum end date so far to the next start date:

SELECT * FROM (
  SELECT MAX(end_date) OVER (ORDER BY start_date) start_gap,
  LEAD(start_date) OVER (ORDER BY start_date) end_gap
  FROM T
)
WHERE start_gap < end_gap;
START_GAP END_GAP
2007-02-01 2007-02-05
2007-02-28 2007-03-01
2007-03-02 2007-03-03

That is so simple – once you think of it.

Looking for a MATCH_RECOGNIZE solution, I tried to compare the current start date to the maximum prior end date. I ran into the problem I wrote about previously: when you use MAX() or other aggregates in a comparison, the current row is included.

I finally realized that if I compared the maximum end date to the next start date, the problem disappeared.

select start_gap, end_gap from t
match_recognize(
  order by start_date
  measures max(end_date) start_gap, next(start_date) end_gap
  all rows per match
  pattern((A|{-B-})+)
  define A as max(end_date) < next(start_date)
);

For those less familiar with the syntax

  • Every row that has a gap with the following row gets assigned to “A”, and all the other rows get assigned to “B”.
  • Normally, all rows per match will output every row, but {-b-} excludes the “B” rows from the output.

This code is logically equivalent to the analytic solution above. It also demonstrates how to avoid the knotty problem of how to exclude the current row from aggregates:

  1. Instead of comparing the aggregate to the current row, just back up a row.
  2. Now the aggregate contains exactly the rows you want.
  3. So compare that aggregate to the value of the row you care about, which is the “next” row.

Merging Overlapping Date Ranges with MATCH_RECOGNIZE

I forgot to add a MATCH_RECOGNIZE solution to my last post on merging overlapping date ranges! That should take me just a few minutes, right? Wrong: it’s not that easy and here’s why.

UPDATE 2018-06-24: “mathguy”, a top contributor to OTN/ODC, pointed out that the “workaround” I originally published was not correct. My thanks to him and my apologies to my readers! Better solution below…

For test data, please refer to the previous post.

To Merge or not to Merge?

The idea is to merge date ranges if they meet or overlap either partially or completely – in other words, we keep merging as long as there is not a gap. How do we know whether or not a gap exists?

  • We sort by start date, then end date.
  • The current row should be merged if its start date is less than or equal to
    the most recent end date, up to but not including the current row.
  • Again, the secret is to compare the most recent end date, not just the end date on the previous line.

The Analytics solution

With analytics, the “windowing clause” lets us do exactly what we want: when we say rows between unbounded preceding and 1 preceding then Oracle will apply the max() function to all the preceding rows, up to but not including the current row. We simply compare the current start date with that maximum preceding end date and we know whether or not to merge.

The MATCH_RECOGNIZE difficulty

MATCH_RECOGNIZE, available starting in version 12c, is pretty powerful and does things analytics can’t do, but it doesn’t really have a windowing clause. When doing comparisons, this can be a problem.

Let’s try a simple test:

select * from t
match_recognize(
  partition by test_case
  order by start_date, end_date
  measures match_number() match, max(end_date) max_date
  all rows per match
  pattern(a+)
  define a as start_date <= max(a.end_date)
)
where rownum <= 2;

Here I'm comparing the start_date with the maximum date already assigned to A – no I’m not! That is not the way the DEFINE clause works. It assumes the current row belongs to A, then does the test based on that assumption. As a result, the end date from the current row is included in the computation. Look at the result:

TEST_CASE START_DATE END_DATE MATCH MAX_DATE
01:precedes 2015-06-01 2015-06-02 1 2015-06-02
01:precedes 2015-06-03 2015-06-04 1 2015-06-04

The “match number” is the same, which means the second line will be merged with the first – and all because the start date is being compared to the wrong end date.

The Preferred Solution

I can’t compare the current start date with the maximum preceding end date – but I can compare the current maximum end date with the next start date! As long as that comparison is true, both the current row and the next row should be part of the match.

select * from t
match_recognize(
  partition by test_case
  order by start_date, end_date
  measures first(start_date) start_date, max(end_date) end_date
  pattern(a* b)
  define a as max(end_date) >= next(start_date)
);

There are two special cases to consider:

  1. The a comparison fails at the first row: in that case, the match should contain the first row only since it meets the implicit b, which is always true. The pattern(a* b) will ensure that.
  2. There is only one row left, so next(start_date) will be null. The a comparison will fail but that row will match the b condition, which is what we want.

The Workaround – NOT

UPDATE 2018-06-24: this solution is not correct. Please see the comment by “mathguy” for an explanation.

MATCH_RECOGNIZE does give you access to other rows than the current row, using PREV, NEXT, FIRST and LAST. In this case, I could say LAST(end_date,1) and get the end date from the preceding matched line – or I could just say PREV(end_date) for the same result.

So how can I be sure that the preceding end date is the most recent? MATCH_RECOGNIZE doesn’t let you use analytic functions around LAST or PREV. I’ll just have to change the ORDER BY and sort by end date first. When I do, it finally works:

select * from t
match_recognize(
  partition by test_case
  order by end_date, start_date
  measures min(start_date) start_date, max(end_date) end_date
  pattern(a b*)
  define b as start_date <= last(end_date,1)
);
TEST_CASE START_DATE END_DATE
01:precedes 2015-06-01 2015-06-02
01:precedes 2015-06-03 2015-06-04
02:meets 2015-06-01 2015-06-03
03:overlaps 2015-06-01 2015-06-04
04:finished by 2015-06-01 2015-06-03
05:contains 2015-06-01 2015-06-04
06:starts 2015-06-01 2015-06-03
07:equals 2015-06-01 2015-06-02
08:started by 2015-06-01 2015-06-03
09:during 2015-06-01 2015-06-04
10:finishes 2015-06-01 2015-06-03
11:overlapped by 2015-06-01 2015-06-04
12:met by 2015-06-01 2015-06-03
13:preceded by 2015-06-01 2015-06-02
13:preceded by 2015-06-03 2015-06-04

 

Merging Overlapping Date Ranges

A recent question on the OTN forum asked about merging date ranges. Thanks to Karthick_App I realized that my previous blog on this subject was incomplete. Here is a solution that will merge any date ranges that meet, or that “overlap” in any way.

Define your terms

As a reminder, Allen’s Interval Algebra defines 13 possible relations between date ranges.

Day of month: 1 2 3 4
1: precedes 1 2
3 4
2: meets 1 2
2 3
3: overlaps 1 3
2 4
4: finished by 1 3
2 3
5: contains 1 4
2 3
6: starts 1 2
1 3
7: equals 1 2
1 2
8: started by 1 3
1 2
9: during 2 3
1 4
10: finishes 2 3
1 3
11: overlapped by 2 4
1 3
12: met by 2 3
1 2
13: preceded by 3 4
1 2

 

As you can see, Allen uses “overlap” only when there is a partial overlap. The OTN questioner really wanted to merge any date ranges that did not have a gap, in other words every case except “precedes” and “preceded by”.

To test this, I’ll create a little table with 13 test cases:

drop table t purge;
create table t (
  test_case varchar2(32) not null,
  start_date date not null,
  end_date date not null
);
Insert into t values ('01:precedes',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('01:precedes',to_date('03','DD'),to_date('04','DD'));
Insert into t values ('02:meets',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('02:meets',to_date('02','DD'),to_date('03','DD'));
Insert into t values ('03:overlaps',to_date('01','DD'),to_date('03','DD'));
Insert into t values ('03:overlaps',to_date('02','DD'),to_date('04','DD'));
Insert into t values ('04:finished by',to_date('01','DD'),to_date('03','DD'));
Insert into t values ('04:finished by',to_date('02','DD'),to_date('03','DD'));
Insert into t values ('05:contains',to_date('01','DD'),to_date('04','DD'));
Insert into t values ('05:contains',to_date('02','DD'),to_date('03','DD'));
Insert into t values ('06:starts',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('06:starts',to_date('01','DD'),to_date('03','DD'));
Insert into t values ('07:equals',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('07:equals',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('08:started by',to_date('01','DD'),to_date('03','DD'));
Insert into t values ('08:started by',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('09:during',to_date('02','DD'),to_date('03','DD'));
Insert into t values ('09:during',to_date('01','DD'),to_date('04','DD'));
Insert into t values ('10:finishes',to_date('02','DD'),to_date('03','DD'));
Insert into t values ('10:finishes',to_date('01','DD'),to_date('03','DD'));
Insert into t values ('11:overlapped by',to_date('02','DD'),to_date('04','DD'));
Insert into t values ('11:overlapped by',to_date('01','DD'),to_date('03','DD'));
Insert into t values ('12:met by',to_date('02','DD'),to_date('03','DD'));
Insert into t values ('12:met by',to_date('01','DD'),to_date('02','DD'));
Insert into t values ('13:preceded by',to_date('03','DD'),to_date('04','DD'));
Insert into t values ('13:preceded by',to_date('01','DD'),to_date('02','DD'));
commit;
TEST_CASE START_DATE END_DATE
01:precedes 2015-06-01 2015-06-02
01:precedes 2015-06-03 2015-06-04
02:meets 2015-06-01 2015-06-02
02:meets 2015-06-02 2015-06-03
03:overlaps 2015-06-01 2015-06-03
03:overlaps 2015-06-02 2015-06-04
04:finished by 2015-06-01 2015-06-03
04:finished by 2015-06-02 2015-06-03
05:contains 2015-06-01 2015-06-04
05:contains 2015-06-02 2015-06-03
06:starts 2015-06-01 2015-06-02
06:starts 2015-06-01 2015-06-03
07:equals 2015-06-01 2015-06-02
07:equals 2015-06-01 2015-06-02
08:started by 2015-06-01 2015-06-03
08:started by 2015-06-01 2015-06-02
09:during 2015-06-02 2015-06-03
09:during 2015-06-01 2015-06-04
10:finishes 2015-06-02 2015-06-03
10:finishes 2015-06-01 2015-06-03
11:overlapped by 2015-06-02 2015-06-04
11:overlapped by 2015-06-01 2015-06-03
12:met by 2015-06-02 2015-06-03
12:met by 2015-06-01 2015-06-02
13:preceded by 2015-06-03 2015-06-04
13:preceded by 2015-06-01 2015-06-02

The solution

I wrote about the “start of group” method some time ago.

  • The first step is to assign 1 to each row that starts a group and 0 to every other row.
    This is the tricky part: I partition by test_case and order by start_date and end_date, and start a group only when the current start_date is greater than any preceding end_date.
  • The next step is to assign a group to each row using a running sum of the 1s and 0s.
  • Finally, I do the group by.
with grp_starts as (
  select test_case, start_date, end_date,
  case
    when start_date > max(end_date) over(
      partition by test_case order by start_date, end_date
      rows between unbounded preceding and 1 preceding
    )
    then 1 else 0
  end grp_start
  from t
)
, grps as (
  select test_case, start_date, end_date,
  sum(grp_start) over(
    partition by test_case order by start_date, end_date
  ) grp
  from grp_starts
)
select test_case,
min(start_date) start_date,
max(end_date) end_date
from grps
group by test_case, grp
order by 1, 2;
TEST_CASE START_DATE END_DATE
01:precedes 2015-06-01 2015-06-02
01:precedes 2015-06-03 2015-06-04
02:meets 2015-06-01 2015-06-03
03:overlaps 2015-06-01 2015-06-04
04:finished by 2015-06-01 2015-06-03
05:contains 2015-06-01 2015-06-04
06:starts 2015-06-01 2015-06-03
07:equals 2015-06-01 2015-06-02
08:started by 2015-06-01 2015-06-03
09:during 2015-06-01 2015-06-04
10:finishes 2015-06-01 2015-06-03
11:overlapped by 2015-06-01 2015-06-04
12:met by 2015-06-01 2015-06-03
13:preceded by 2015-06-01 2015-06-02
13:preceded by 2015-06-03 2015-06-04

Conclusion

Once again, the “start of group” method demonstrates its power, combined with a “window clause” in the analytic function that lets us identify the starting row of our group. Thanks again to Karthick_App for noticing that my first solution was incomplete.

And don’t forget: this doesn’t work if the end date can be NULL. If you allow that, substitute DATE ‘9999-12-31’ for the NULL at the beginning of the query, then translate it back to NULL at the end.

Join Tables on Date Ranges

A recent question on the OTN SQL forum asked how best to join two tables related by ID and date range, in order to insert one row per date into a data warehouse. One solution was to expand the data from each table, creating one row per date, then join on date. I think it’s more efficient to join on date range, then expand. Here’s how.

Input Tables and Requirement

I am going to simplify a bit from the OTN setup.

The first table contains an ID, a start date and an end date. (I recommend to always use exclusive end dates.) The final output should have one row for every date within the date range.

create table table_1 (
  id number primary key,
  start_date date not null check (start_date = trunc(start_date)),
  end_date date not null check (end_date = trunc(end_date)),
  check (end_date > start_date)
);
TABLE_1 Output
ID START_DATE END_DATE EACH_DATE
1 2015-03-01 2015-03-05 2015-03-01
2015-03-02
2015-03-03
2015-03-04

 

The second table contains the ID and a “review date” that assigns a “group ID”. In the output, each row should contain the data from the most recent review – or null if there was no prior review.

create table table_2 (
  id number,
  review_date date not null check (review_date = trunc(review_date)),
  group_id varchar2(20) not null,
  primary key (id, review_date)
);
TABLE_1 Output TABLE_2
ID START_DATE END_DATE EACH_DATE REVIEW_DATE GROUP_ID ID REVIEW_DATE GROUP_ID
1 2015-03-01 2015-03-05 2015-03-01 (null) (null)
2015-03-02 2015-03-02 First 1 2015-03-02 First
2015-03-03 2015-03-02 First
2015-03-04 2015-03-04 Second 1 2015-03-04 Second

 

Note: the most recent review is used even if that review took place before the start date in table 1.

The algorithm: join, then expand

If you look at the pretty table just above, you might think that I need to generate the green rows, then join each table to them. That means joining three sources. When I tell you that each ID can have a date range of several months, then you see that the extra work and memory required are not negligible.

So let’s try to join the two tables directly, based on intersecting date ranges:

  1. TABLE_1 already has a start date and an end date.
  2. TABLE_2 has a review date, which starts a range.
    1. I use the LEAD() analytic function to get the next review date, which is the end of the range.
    2. To identify the first row for each ID, I use the LAG() analytic function to get the previous review date (which will be NULL on the first row). We’ll see why.
  3. I join the tables:
    when the review date is earlier than the end date,
    and the next review date is either null or later than the start date.
    It’s a LEFT JOIN since I produce rows whether the ID is in TABLE_2 or not.
  4. Now I have to create a new date range that represents just the intersection.
    1. The beginning is the later of start date and review date – except if the previous review date is null.
      In that case this is the first row for an ID, so the beginning is the start date.
    2. The end is the earlier of end date and next review date.
  5. Finally I expand the rows
    1. When expanding the first row, I may find a start date that is earlier than the review date (see 4A above).
      For days earlier than the review date, I output a row with a null review date and a null group id.

Test data

Allen’s Interval Algebra defines 13 possible relations between date ranges.

Day of month: 1 2 3 4
1: precedes 1 2
3 4
2: meets 1 2
2 3
3: overlaps 1 3
2 4
4: finished by 1 3
2 3
5: contains 1 4
2 3
6: starts 1 2
1 3
7: equals 1 2
1 2
8: started by 1 3
1 2
9: during 2 3
1 4
10: finishes 2 3
1 3
11: overlapped by 2 4
1 3
12: met by 2 3
1 2
13: preceded by 3 4
1 2

In my test data, ID 0 has no review date. IDs 1 through 13 have one review date, and IDs 14 through 26 have two. The review dates correspond to the start and end dates for the green rows in the table.

alter session set nls_date_format='DD';

insert into table_1 select
0, to_date(1),to_date(2) from dual union all select
1, to_date(1),to_date(2) from dual union all select
2, to_date(1),to_date(2) from dual union all select
3, to_date(1),to_date(3) from dual union all select
4, to_date(1),to_date(3) from dual union all select
5, to_date(1),to_date(4) from dual union all select
6, to_date(1),to_date(2) from dual union all select
7, to_date(1),to_date(2) from dual union all select
8, to_date(1),to_date(3) from dual union all select
9, to_date(2),to_date(3) from dual union all select
10, to_date(2),to_date(3) from dual union all select
11, to_date(2),to_date(4) from dual union all select
12, to_date(2),to_date(3) from dual union all select
13, to_date(3),to_date(4) from dual;

insert into table_1
select id+13, start_date, end_date
from table_1
where id > 0;

insert into table_2 select
1, to_date(3),'precedes' from dual union all select
2, to_date(2),'meets' from dual union all select
3, to_date(2),'overlaps' from dual union all select
4, to_date(2),'finished by' from dual union all select
5, to_date(2),'contains' from dual union all select
6, to_date(1),'starts' from dual union all select
7, to_date(1),'equals' from dual union all select
8, to_date(1),'started by' from dual union all select
9, to_date(1),'during' from dual union all select
10, to_date(1),'finishes' from dual union all select
11, to_date(1),'overlapped by' from dual union all select
12, to_date(1),'met by' from dual union all select
13, to_date(1),'preceded by' from dual;

insert into table_2
select id+13, review_date, group_id from table_2;

insert into table_2
select id,
review_date +
case when group_id = 'during' then 3
  when group_id in ('overlaps','starts','finishes','overlapped by') then 2
  else 1
end,
'after ' || group_id
from table_2
where id > 13;
commit;

alter session set nls_date_format='yyyy-mm-dd';

The solution (finally!)

select id, column_value each_date, start_date, end_date, 
  case when column_value >= review_date then review_date end review_date, 
  case when column_value >= review_date then group_id end group_id
from (
  select a.id, start_date, end_date, review_date, group_id,
  case when prev_rd is null
    then start_date
    else greatest(start_date, review_date)
  end range_start,
  case when end_rd is null
    then end_date
    else least(end_date, end_rd)
  end range_end
  from table_1 a
  left join (
    select review_date, group_id, id,
    lead(review_date) over (partition by id order by review_date) end_rd,
    lag(review_date) over (partition by id order by review_date) prev_rd
    from table_2
  ) b
  on a.id = b.id
    and start_date < coalesce(end_rd, end_date)
    and end_date > review_date
) 
,
table(cast(multiset(
  select range_start-1+level from dual
  connect by range_start-1+level < range_end
) as sys.odcidatelist))
order by id, each_date;
  • Lines 16-19: get previous and next review dates from table_2
  • Lines 21-23: left join on ID, and end date greater than review date, and start date less than next review date. If the next review date is null, we are on the last review date so we go all the way to the end date.
  • Lines 6-13: calculate the date range for the joined row.
  • Lines 2-3: leave the review date and group id columns NULL if EACH_DATE is earlier than the first review date.
ID EACH_DATE START_DATE END_DATE REVIEW_DATE GROUP_ID
0 2015-03-01 2015-03-01 2015-03-02
1 2015-03-01 2015-03-01 2015-03-02
2 2015-03-01 2015-03-01 2015-03-02
3 2015-03-01 2015-03-01 2015-03-03
3 2015-03-02 2015-03-01 2015-03-03 2015-03-02 overlaps
4 2015-03-01 2015-03-01 2015-03-03
4 2015-03-02 2015-03-01 2015-03-03 2015-03-02 finished by
5 2015-03-01 2015-03-01 2015-03-04
5 2015-03-02 2015-03-01 2015-03-04 2015-03-02 contains
5 2015-03-03 2015-03-01 2015-03-04 2015-03-02 contains
6 2015-03-01 2015-03-01 2015-03-02 2015-03-01 starts
7 2015-03-01 2015-03-01 2015-03-02 2015-03-01 equals
8 2015-03-01 2015-03-01 2015-03-03 2015-03-01 started by
8 2015-03-02 2015-03-01 2015-03-03 2015-03-01 started by
9 2015-03-02 2015-03-02 2015-03-03 2015-03-01 during
10 2015-03-02 2015-03-02 2015-03-03 2015-03-01 finishes
11 2015-03-02 2015-03-02 2015-03-04 2015-03-01 overlapped by
11 2015-03-03 2015-03-02 2015-03-04 2015-03-01 overlapped by
12 2015-03-02 2015-03-02 2015-03-03 2015-03-01 met by
13 2015-03-03 2015-03-03 2015-03-04 2015-03-01 preceded by
14 2015-03-01 2015-03-01 2015-03-02
15 2015-03-01 2015-03-01 2015-03-02
16 2015-03-01 2015-03-01 2015-03-03
16 2015-03-02 2015-03-01 2015-03-03 2015-03-02 overlaps
17 2015-03-01 2015-03-01 2015-03-03
17 2015-03-02 2015-03-01 2015-03-03 2015-03-02 finished by
18 2015-03-01 2015-03-01 2015-03-04
18 2015-03-02 2015-03-01 2015-03-04 2015-03-02 contains
18 2015-03-03 2015-03-01 2015-03-04 2015-03-03 after contains
19 2015-03-01 2015-03-01 2015-03-02 2015-03-01 starts
20 2015-03-01 2015-03-01 2015-03-02 2015-03-01 equals
21 2015-03-01 2015-03-01 2015-03-03 2015-03-01 started by
21 2015-03-02 2015-03-01 2015-03-03 2015-03-02 after started by
22 2015-03-02 2015-03-02 2015-03-03 2015-03-01 during
23 2015-03-02 2015-03-02 2015-03-03 2015-03-01 finishes
24 2015-03-02 2015-03-02 2015-03-04 2015-03-01 overlapped by
24 2015-03-03 2015-03-02 2015-03-04 2015-03-03 after overlapped by
25 2015-03-02 2015-03-02 2015-03-03 2015-03-02 after met by
26 2015-03-03 2015-03-03 2015-03-04 2015-03-02 after preceded by

 

Date Ranges in Data Warehouses using Oracle 12c

When you load data with an “effective date” into a Data Warehouse, what happens when the new data is the same as yesterday’s? Do you add a new record with the new “effective date” or do you update a date range in the existing record?

At Open World last month, I got both answers: Tim Gorman presented an efficient method for loading “effective dates” and Dr. Holger Friedrich presented a way to maintain date ranges.

I mentioned to Dr. Friedrich that the 12c MATCH_RECOGNIZE clause could maintain those date ranges more efficiently. By the time I actually wrote the SQL, Dr. Friedrich had written practically the same thing. Here is my variant, inspired from Dr. Friedrich’s test case.

The Test Data

I have three tables:

  • ALL_DATA is the Data Warehouse. Each row is valid starting from DATE_FROM (inclusive) to DATE_TO (exclusive). The “current” rows have a DATE_TO of 9999-12-31.
  • CURRENT_DATA contains all the rows in ALL_DATA that have DATE_TO = 9999-12-31.
  • STAGING_DATA contains the data to be loaded. DATE_EFFECTIVE is the “effective date”.

To help me see what happens, I increment a sequence every time I load data to ALL_DATA. Updated rows are assigned the sequence number + 1/10, and new rows are assigned the sequence number + 2/10.

drop sequence load_seq;
create sequence load_seq;

drop table all_data purge;
create table all_data as
select 1 pk_id,
date '2014-11-13' date_from,
date '9999-12-31' date_to,
'A' rec_value,
load_seq.nextval+.2 load_id
from dual;

drop table current_data purge;
create table current_data as
select * from all_data;

drop table staging_data purge;
create table staging_data(pk_id, date_effective, rec_value)
as select
1, date '2014-12-01', 'A' from dual union all select
1, date '2014-12-02', 'B' from dual union all select
1, date '2014-12-03', 'B' from dual union all select
1, date '2014-12-04', 'C' from dual union all select
1, date '2014-12-05', 'A' from dual union all select
1, date '2014-12-06', 'A' from dual union all select
1, date '2014-12-07', 'D' from dual;
  • Line 20: The first row should be combined with the existing row in ALL_DATA
  • Lines 21-22: should be combined into one row
  • Lines 24-25: should be combined into one row.

Finding the Data to Change

Dr. Friedrich used the method I call “Start of Group” to find out what data to change. MATCH_RECOGNIZE replaces this method more efficiently and with less code.

select * from (
  select pk_id, date_effective, rec_value
  from staging_data
  union all
  select pk_id, date_from, rec_value
  from all_data
  where date_to >= (select min(date_effective) from staging_data)
) match_recognize (
  partition by pk_id order by date_effective
  measures first(date_effective) date_from,
    nvl(next(date_effective), date '9999-12-31') date_to,
    rec_value rec_value
  pattern(a b*)
  define b as rec_value = prev(rec_value)
);
  • Lines 5-7: I make sure to include all the rows in ALL_DATA that could be impacted by the new data.
  • Lines 13-14: In this simple example, REC_VALUE is the only data that can change. I group together all the consecutive rows that have the same value.
PK_ID DATE_FROM DATE_TO REC_VALUE
1 2014-11-13 2014-12-02 A
1 2014-12-02 2014-12-04 B
1 2014-12-04 2014-12-05 C
1 2014-12-05 2014-12-07 A
1 2014-12-07 9999-12-31 D

 

Merging the Changes

var load_id number;
exec :load_id := load_seq.nextval;

merge into all_data o
using (
  select * from (
    select pk_id, date_effective, rec_value
    from staging_data
    union all
    select pk_id, date_from, rec_value
    from all_data
    where date_to >= (select min(date_effective) from staging_data)
  ) match_recognize (
    partition by pk_id order by date_effective
    measures first(date_effective) date_from,
      nvl(next(date_effective), date '9999-12-31') date_to,
      rec_value rec_value
    pattern(a b*)
    define b as rec_value = prev(rec_value)
  )
) n
on ( (o.pk_id, o.date_from) = ((n.pk_id, n.date_from)) )
when matched then update set
  load_id = to_number(:load_id)+.1, date_to = n.date_to
  where o.date_to != n.date_to
when not matched then insert values(
  n.pk_id, n.date_from, n.date_to, n.rec_value, :load_id+.2
);

5 rows merged.
  • Line 25: In my USING clause, I may get rows from ALL_DATA that don’t need to be changed, so I check DATE_TO to make sure I don’t do an update for nothing.
  • Line 30: As you can see from the output below, I updated the existing row and inserted 4 new rows.
select * from all_data order by 3;
PK_ID DATE_FROM DATE_TO REC_VALUE LOAD_ID
1 2014-11-13 2014-12-02 A 2.1
1 2014-12-02 2014-12-04 B 2.2
1 2014-12-04 2014-12-05 C 2.2
1 2014-12-05 2014-12-07 A 2.2
1 2014-12-07 9999-12-31 D 2.2

 

Refreshing CURRENT_DATA

merge into current_data o
using (
  select * from all_data
  where date_to = date '9999-12-31'
) n
on (o.pk_id = n.pk_id)
when matched then update set
  load_id = n.load_id, date_from = n.date_from,
  rec_value = n.rec_value
  where o.load_id != n.load_id
when not matched then insert values(
  n.pk_id, n.date_from, n.date_to, n.rec_value, n.load_id
);

1 rows merged.

Avoiding Unnecessary Updates

If you run each MERGE statement a second time, you will see the lovely message “0 rows merged.” I always do this test. If you are changing data the second time, either there is a bug or you are updating a row to be the same as it was before, which is a lot of work for no benefit.

SQL and date ranges: don’t make NULL mean something

We often use NULL values in date ranges, usually in the end date to signify the indefinite future. I have decided that this is a bad idea; here are my reasons.

[Update 2014-07-04] There have been some great replies to this post. I am going to address some of their points.

1) NULL should not mean anything

According to the SQL standard, NULL means “unknown”. That is why you can’t really compare a NULL value to anything, you can only say if it IS NULL or not.

If you use NULL in date ranges, it doesn’t just mean something, it means two different things! In the start date it means “from the beginning of time”, and in the end date it means “‘until the end of time”. This in not only contrary to the “official” meaning of NULL, it is confusing.

[Update] Kevan Gelling points out that there may really be “unknown” values, in which case NULL should be allowed. I agree, as long as NULL is reserved for that use.

Jeffrey Kemp says: ‘In most cases an End Date of NULL means “we don’t know (yet) what the end date will be – or even if it will ever end”.’ Jeffrey, you should add “we don’t even know if it has already ended”! Your phrase taken alone implies that the end date is in the future, not in the past. This makes NULL meaningful.

2) We can use real date limits instead of NULL

One argument for using NULL is that it means the earliest possible start date, or the latest possible end date. We don’t need NULL for that! In Oracle, the earliest possible date is 4712-01-01 B.C., or DATE '-4712-01-01'. The latest possible date is DATE '9999-12-31' (you can add the time element 23:59:59 if you want to be a purist.)

To enforce this, I suggest declaring the start and end date columns as NOT NULL with default values. In Database 12c, I would use the DEFAULT ON NULL clause: this clause puts the default value in the column even if you explicitly try to put a NULL there.

If you want the output to show NULL instead of these default values, you do have to use something like NULLIF() in your SELECT clause.

[Update] Again, Kevan Gelling argues that NULL may be necessary to indicate a value that is not known. In that case my suggestion is no good, but you still have to use something other than NULL to indicate “beginning of time” and “end of time”.

Several posters bring up the fact that artificial extreme dates will “skew” the data and make it hard for the optimizer to choose the best plan. They are right, I should mention this. However, the data is skewed whether I use extreme dates or NULL. Supposing I do use NULL to mean “the end of time”, many queries will have to use END_DATE IS NULL in the WHERE clause, so the skew is there no matter what.

3) Oracle doesn’t index NULLs

When you query data with date ranges, you often have to check “greater than” one value and “less than” another value. This may require separate indexes on start date and end date. If you have NULLs in your date ranges, those indexes will not be used since Oracle doesn’t create index entries when all the indexed columns are NULL.

If you use real values and NOT NULL, your indexes will always work.

[Update] Some readers apparently didn’t see the phrase that I have now underlined. The header was probably misleading: Oracle will indeed index a row if any of the indexed columns is not NULL.

Some objected that indexes on dates alone are rarely used: date ranges are almost always applied to some object, so the index will include the object and one or both dates. In that case, the rows with NULL dates will be indexed. I agree. In that situation, NULLs don’t cause a problem, and “skew” won’t either, as long as the object comes before the date in the index column list.

4) Queries on NULLable date ranges are hard

I have blogged about various date range problems: finding gaps, finding overlaps and merging contiguous ranges. Almost always, my solutions worked fine without NULLs and broke when I introduced NULLs. Handling NULLs required either more complex queries or substituting real values for NULL. Why go to all that work when we can just use the real values to begin with?

[Update] Even if we need NULL values sometimes to mean “unknown value”, we still don’t use IS NULL in our queries, so indexes should work when needed.

5) Unique constraints don’t work with NULL

Most queries on date ranges, including mine, assume that start dates are unique. In some cases, it may also be useful to have unique end dates. The only way to make sure they are unique is to add a unique constraint. Unfortunately, uniqueness is only enforced for NOT NULL values.

Most of the time, your constraint will be on two columns: some object and a date. In this case the unique constraint will work as long as the object column is NOT NULL. However, anytime you need unique dates throughout a table you must define them as NOT NULL and use default values.

[Update] As mentioned under paragraph 3), this point is only valid for unique constraints on the date alone. This is probably a rare case. Also, as Kevan reminded me, you could always use a function-based index on (DATE_COL, ‘X’) to make sure every row was indexed.

[Update] Conclusion: NULL should mean “unknown” only

As readers have pointed out, my arguments 3) and 5) are weak, since they only apply to indexes or constraints on a date column alone, and those are rarely needed.

I’ll stand by arguments 1), 2) and 4), but thanks to Kevan I’ll add one thing: just because I don’t use NULL to mean “beginning or end of time”, that doesn’t mean I might not need it to mean “unknown value”. In that case, default values may not be the way to go.

Finally, please read the replies: good stuff, and thanks to all!

Overlapping ranges with priority

A few years ago, Alberto Dell’Era blogged about product prices with overlapping date ranges; “on an overlapping range, the strongest priority (lower number) wins.” His analysis, explanation and solution are all excellent. I’m revisiting the subject to suggest a few improvements (and mention MATCH_RECOGNIZE yet again).

To explain more easily, Alberto uses letters (a,b,c,d,e) instead of dates, and in the code he changes the letters to numbers using the ascii() function. (The logic is the same, whether the ranges are dates or something else.) He uses the term sku as a product identifier. “Prices of skus (i.e. goods) have validity ranges and can overlap; on an overlapping range, the strongest priority (lower number) wins.” Sample input:

b______(0,$200)_______d
b__________c______(1,$300)_______e

Notice that d and e are not colored, because they are not included in the range. Since priority 0 is stronger than 1, the expected output is:

b_______($200)________d
b__________c__________d__($300)__e

The Algorithm

Step Expected Output
1) Get distinct start and end points b,c,d,e
2) Create non-overlapping base ranges (Alberto has a not null end point, but in general nulls may be useful.) b__________c
b__________c__________d
b__________c__________d__________e
b__________c__________d__________e_(null)
3) Assign price/priority to base ranges b_(0,$200)_c
b__________c_(0,$200)_d
b__________c_(1,$300)_d
b__________c__________d_(1,$300)_e
4) For each base range, keep highest priority b__($200)__c
b__________c__($200)__d
b__________c__________d__($300)__e
5) Merge consecutive ranges with same price b_______($200)________d
b__________c__________d__($300)__e

1) Get distinct start and end points

Alberto uses UNION to get the start points, then the end points.

with instants as (
  select sku, a as i from ranges
  union
  select sku, b as i from ranges
),

This requires two scans of the index. I suggest using UNPIVOT, which does the work with one scan. I’ll show you that next.

2) Create non-overlapping base ranges

Alberto does this with the analytic LEAD function:

base_ranges as (
  select * from (
    select sku,
    i as ba,
    lead(i) over (partition by sku order by i) as bb
    from instants
  )
  where bb is not null
),

I do the same thing, but thanks to UNPIVOT I can do steps 1) and 2) with one SELECT:

with base_ranges as (
  select distinct sku, i ba,
  lead(i) over(partition by sku order by i, col) bb,
  dense_rank() over(partition by sku order by i) seq
  from ranges
  unpivot(i for col in(a,b))
),

Notice I add the column SEQ, which assigns consecutive integers to each date range. SEQ will allow me to use the Tabibitosan method in step 5) !

3) Assign price/priority to base ranges

We both do this by joining the base ranges to the original rows. Here’s Alberto’s code:

factored_ranges as (
  select i.sku, bi.ba, bi.bb, i.a, i.b, i.prio, i.price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba and bi.ba < i.b)  
),

4) For each base range, keep highest priority

Alberto uses the DENSE_RANK analytic function on each base range, giving a rank of 1 to the highest priority, then keeps only the rows where the rank is 1.

strongest_factored_ranges as (
  select sku, ba, bb, prio, price
  from (
    select sku, ba, bb, prio, price,
    dense_rank () over (partition by sku, ba, bb order by prio) as rnk
    from factored_ranges
  )
  where rnk = 1
),

I decided to use the KEEP (DENSE_RANK FIRST...) aggregate function. I can then combine the aggregation with another analytic function in the same SELECT. I’ll show this in a minute.

5) Merge consecutive ranges with same price

Alberto uses an equivalent to the “start of group” method to merge the prices.

-- STEP will be zero if a range can be joined to the previous one, since:
-- a) they are consecutive (no gap between them)
-- b) they have the same price
-- see http://www.oracle.com/technetwork/issue-archive/o24asktom-095715.html
ranges_with_step as (
  select sku, ba, bb, prio, price,
  decode (
    price,
    lag(price) over (partition by sku order by ba),
    ba - lag(bb) over (partition by sku order by ba),
    1000
  ) step
  from strongest_factored_ranges
), 
-- the integral of step over the ranges
-- joinable ranges will hence have the same value for "interval" since step is zero
ranges_with_step_integral as (
  select sku, ba, bb, prio, price, step,
  sum(step) over (
    partition by sku order by ba
    rows between unbounded preceding and current row
  ) as integral
  from ranges_with_step
), 
-- this joins the joinable ranges
ranges_joined as (
  select * from (
    select sku, ba, bb, prio, price, step, integral,
    min(ba) over (partition by sku, integral) as a,
    max(bb) over (partition by sku, integral) as b
    from ranges_with_step_integral
  )
  where step > 0 
)
select sku, chr(a) "START", chr(b) "END", price
from ranges_joined;

I’ve waited until now to show my code, because I do steps 3), 4) and part of 5) in the same SELECT.

strongest_factored_ranges as (
  select bi.sku,
  bi.seq -
    row_number() over(partition by bi.sku order by min(i.prio), bi.seq) grp,
  bi.ba a, bi.bb b, 
  min(i.price) keep (dense_rank first order by i.prio) price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba
  and (bi.bb <= i.b or bi.bb is null and i.b is null))
  group by bi.sku, bi.seq, bi.ba, bi.bb
)
select sku, chr(min(a)) "START", chr(max(b)) "END", price
from strongest_factored_ranges
group by sku, grp, price;
  • I changed the WHERE clause to allow for NULL end points, just to show how.
  • The KEEP (DENSE_RANK FIRST...) aggregate function gives me the price of the highest priority for each base range.
  • Once the aggregation is done, I can use an analytic function, and even refer to the result of the aggregation within the analytic. Here is where I use the Tabibitosan method to assign the same grp to consecutive ranges with the same price.

Using 12c MATCH_RECOGNIZE

With 12c row pattern matching, I don’t need the Tabibitosan method so I can simplify the preceding subqueries and use fewer analytic functions. Here is the entire solution:

-- steps 1) and 2)
with base_ranges as (
  select distinct sku, i ba,
  lead(i) over(partition by sku order by i, col) bb
  from ranges
  unpivot(i for col in(a,b))
)
-- steps 3) and 4)
, strongest_factored_ranges as (
  select bi.sku,
  bi.ba a, bi.bb b, 
  min(i.price) keep (dense_rank first order by i.prio) price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba
  and (bi.bb <= i.b or bi.bb is null and i.b is null))
  group by bi.sku, bi.ba, bi.bb
)
-- step 5)
select * from strongest_factored_ranges
match_recognize(
  partition by sku order by a, b
  measures chr(first(a)) "START", chr(last(b)) "END",
    first(price) price
  pattern(A B*)
  define B as (a, price) = ((prev(b), a.price))
);

Performance

You may be a bit disappointed in my performance findings: I certainly am. The fact is, using UNPIVOT instead of UNION in step 1) is clearly better, because it does one index scan instead of two. Other than that, Alberto’s solution and my pre-12c “improvements” have equivalent performance and use about the same resources. On the other hand, MATCH_RECOGNIZE is a little slower because it uses more CPU. In conclusion, in version 11.2 I would keep Alberto’s solution (except for that first UNION) because it is easier to maintain, and I would use MATCH_RECOGNIZE in version 12c for the same reason (unless the bit of extra CPU is a problem).