Merging contiguous date ranges

Last time I wrote about finding gaps in date ranges: this post is about merging date ranges that “meet”. This is a frequent question; the answer applies to any ranges, not just dates. As a reminder, I consider ranges to “meet” when the end value of one record is equal to the start value of the following record.

[Update 2014-07-04] Warning: none of these solutions work when the test data contains NULL values. I have added a new section to talk about NULLs.

Some test data, using date ranges:

drop table t purge;
create table t ( id int, start_date date, end_date date );
Insert into T values (1, DATE '2014-01-01', DATE '2014-01-03');
Insert into T values (2, DATE '2014-01-03', DATE '2014-01-05');
Insert into T values (3, DATE '2014-01-05', DATE '2014-01-07');
Insert into T values (4, DATE '2014-01-08', DATE '2014-02-01');
Insert into T values (5, DATE '2014-02-01', DATE '2014-02-10');
Insert into T values (6, DATE '2014-02-05', DATE '2014-02-28');
Insert into T values (7, DATE '2014-02-10', DATE '2014-02-15');
ID START_DATE END_DATE
1 2014-01-01 00:00:00 2014-01-03 00:00:00
2 2014-01-03 00:00:00 2014-01-05 00:00:00
3 2014-01-05 00:00:00 2014-01-07 00:00:00
4 2014-01-08 00:00:00 2014-02-01 00:00:00
5 2014-02-01 00:00:00 2014-02-10 00:00:00
6 2014-02-05 00:00:00 2014-02-28 00:00:00
7 2014-02-10 00:00:00 2014-02-15 00:00:00

I expect columns 1 through 3 to be merged, as well as columns 4 and 5. The requirement says to merge contiguous ranges, not overlapping ranges, so columns 6 and 7 should stay as they are.

The “start of group” method

I wrote about the “start of group” method a week ago.

  • The first step is to assign 1 to each row that starts a group and 0 to every other row. Here the test is simple: a group starts when the current start date is not equal to the previous end date.
  • The next step is to assign a group to each row using a running sum of the 1s and 0s.
  • Finally, I do the group by.
with grp_starts as (
  select start_date, end_date,
  case
    when start_date = lag(end_date) over(order by start_date, end_date)
    then 0 else 1
  end grp_start
  from t
)
, grps as (
  select start_date, end_date,
  sum(grp_start) over(order by start_date, end_date) grp
  from grp_starts
)
select min(start_date) start_date,
max(end_date) end_date
from grps
group by grp
order by 1, 2;
START_DATE END_DATE
2014-01-01 00:00:00 2014-01-07 00:00:00
2014-01-08 00:00:00 2014-02-10 00:00:00
2014-02-05 00:00:00 2014-02-28 00:00:00
2014-02-10 00:00:00 2014-02-15 00:00:00

Using 12c row pattern matching

Here the “pattern” is: any one row (A), followed by zero or more rows (B*) whose start date is equal to the previous end date. By default:

  • the next match starts with the row just after the prior match;
  • only one row per match is returned;
  • since A is not defined, the implicit condition is TRUE: the row always matches.
select * from t
match_recognize(
  order by start_date, end_date
  measures first(start_date) start_date, last(end_date) end_date
  pattern(A B*)
  define B as start_date = prev(end_date)
);

Merging overlapping rows

Suppose the requirement is to merge rows that overlap as well as rows that “meet”. With the “start of group” method, I change my first test from “=” to “<=” and I’m done. With MATCH_RECOGNIZE, I change DEFINE B from “=” to “<=” and I return max(end_date) instead of last(end_date).

select * from t
match_recognize(
  order by start_date, end_date
  measures first(start_date) start_date, max(end_date) end_date
  pattern(A B*)
  define B as start_date <= prev(end_date)
);
START_DATE END_DATE
2014-01-01 00:00:00 2014-01-07 00:00:00
2014-01-08 00:00:00 2014-02-28 00:00:00

[Update 2014-07-02: handling NULL values]

You may want to allow NULL values for START_DATE (for the indefinite past) or END_DATE (for the indefinite future). If you do, all these solutions break because functions like MAX() leave NULLs out. The workarounds are complicated and (I suspect) less efficient.

I think the simplest and surest way to fix this is to substitute real dates for the null values. The earliest possible date in Oracle SQL is 4712-01-01 B.C. and the latest is 9999-12-31. I use COALESCE to change nulls to these dates, and NULLIF to change these dates back to nulls for display. Here are simple test code and adjusted solutions. Every statement will return one line with START_DATE as NULL and END_DATE as NULL.

drop table t cascade constraints purge;
create table t ( id int, start_date date, end_date date );
Insert into T values (1, null, DATE '2014-01-03');
Insert into T values (2, DATE '2014-01-03', DATE '2014-01-05');
Insert into T values (3, DATE '2014-01-05', null);
commit;

with data as (
  select id,
  coalesce(start_date, date '-4712-01-01') start_date,
  coalesce(end_date, date '9999-12-31') end_date
  from t
)
, grp_starts as (
  select start_date, end_date,
  case
    when start_date = lag(end_date) over(order by start_date, end_date)
    then 0 else 1
  end grp_start
  from data
)
, grps as (
  select start_date, end_date,
  sum(grp_start) over(order by start_date, end_date) grp
  from grp_starts
)
select nullif(min(start_date), date '-4712-01-01') start_date,
nullif(max(end_date), date '9999-12-31') end_date
from grps
group by grp
order by 1 nulls first, 2;

select nullif(start_date, date '-4712-01-01') start_date,
nullif(end_date, date '9999-12-31') end_date
from (
  select id,
  coalesce(start_date, date '-4712-01-01') start_date,
  coalesce(end_date, date '9999-12-31') end_date
  from t
)
match_recognize(
  order by start_date, end_date
  measures first(start_date) start_date, last(end_date) end_date
  pattern(A B*)
  define B as start_date = prev(end_date)
);

with data as (
  select id,
  coalesce(start_date, date '-4712-01-01') start_date,
  coalesce(end_date, date '9999-12-31') end_date
  from t
)
, grp_starts as (
  select start_date, end_date,
  case
    when start_date <= lag(end_date) over(order by start_date, end_date)
    then 0 else 1
  end grp_start
  from data
)
, grps as (
  select start_date, end_date,
  sum(grp_start) over(order by start_date, end_date) grp
  from grp_starts
)
select nullif(min(start_date), date '-4712-01-01') start_date,
nullif(max(end_date), date '9999-12-31') end_date
from grps
group by grp
order by 1 nulls first, 2;

select nullif(start_date, date '-4712-01-01') start_date,
nullif(end_date, date '9999-12-31') end_date
from (
  select id,
  coalesce(start_date, date '-4712-01-01') start_date,
  coalesce(end_date, date '9999-12-31') end_date
  from t
)
match_recognize(
  order by start_date, end_date
  measures first(start_date) start_date, max(end_date) end_date
  pattern(A B*)
  define B as start_date <= prev(end_date)
);

[Update 2014-03-18] Using the Tabibitosan method

Update 2: Warning! Frank’s method doesn’t work when there are both gaps and overlaps in the date ranges. I don’t recommend it.

Just today on the OTN forum, Frank Kulash taught me how to apply this method, which he calls “Fixed Difference”, to date ranges. I think his solution is more efficient, since it does fewer window sorts, but it only works with number and date ranges, not timestamps — and it doesn’t work when there are both gaps and overlaps in the date ranges.

First, here is the subquery that calculates the groups:

select start_date, end_date,
end_date -
  sum(end_date - start_date) over(order by start_date, end_date)
as grp
from t
START_DATE END_DATE GRP
2014-01-01 2014-01-03 2014-01-01
2014-01-03 2014-01-05 2014-01-01
2014-01-05 2014-01-07 2014-01-01
2014-01-08 2014-02-01 2014-01-02
2014-02-01 2014-02-10 2014-01-02
2014-02-05 2014-02-28 2013-12-28
2014-02-10 2014-02-15 2013-12-10

The first “grp” is equal to start_date, since we add end_date and subtract it. In the second row, we are using the second end_date instead of the first, which is like adding the difference between the two end_dates. At the same time, we are subtracting the difference between the second start_date and the second end_date. So, if the second start_date is the same as the first end_date, the result will be the same: the “grp” will be the first start_date of the contiguous range.

After that, we just need to “group by”.

with grps as (
  select start_date, end_date,
  end_date -
    sum(end_date - start_date) over(order by start_date, end_date)
  as grp
  from t
)
select min(start_date) start_date,
max(end_date) end_date
from grps
group by grp
order by 1, 2;

Unfortunately, when there are both gaps and overlaps, they can cancel each other out and a row may be considered part of a contiguous series when in fact it is not. For an example, see the forum thread.

About these ads

One thought on “Merging contiguous date ranges

  1. Thank you SO much! I have been toiling for days trying to crack this nut! Works perfectly!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s