MATCH_RECOGNIZE: gaps in date ranges

To find gaps between date ranges, there is an elegant solution using analytic functions. We can follow the same logic with MATCH_RECOGNIZE, using a technique that avoids the problems with MIN() and MAX() comparisons I just wrote about.

Here are some simple test data

create table t ( start_date date, end_date date );
Insert into T values (DATE '2007-01-01', DATE '2007-01-15');
Insert into T values (DATE '2007-01-03', DATE '2007-01-10');
Insert into T values (DATE '2007-01-12', DATE '2007-01-25');
Insert into T values (DATE '2007-01-20', DATE '2007-02-01');
Insert into T values (DATE '2007-02-05', DATE '2007-02-10');
Insert into T values (DATE '2007-02-05', DATE '2007-02-28');
Insert into T values (DATE '2007-02-10', DATE '2007-02-15');
Insert into T values (DATE '2007-03-01', DATE '2007-03-02');
Insert into T values (DATE '2007-03-03', DATE '2007-03-16');

Please note that I use exclusive end dates: the second to last range ends when March 2d starts, so there should be a gap from March 2d to March 3d.

Using analytics, I compare the maximum end date so far to the next start date:

  SELECT MAX(end_date) OVER (ORDER BY start_date) start_gap,
  LEAD(start_date) OVER (ORDER BY start_date) end_gap
WHERE start_gap < end_gap;
2007-02-01 2007-02-05
2007-02-28 2007-03-01
2007-03-02 2007-03-03

That is so simple – once you think of it.

Looking for a MATCH_RECOGNIZE solution, I tried to compare the current start date to the maximum prior end date. I ran into the problem I wrote about previously: when you use MAX() or other aggregates in a comparison, the current row is included.

I finally realized that if I compared the maximum end date to the next start date, the problem disappeared.

select start_gap, end_gap from t
  order by start_date
  measures max(end_date) start_gap, next(start_date) end_gap
  all rows per match
  define A as max(end_date) < next(start_date)

For those less familiar with the syntax

  • Every row that has a gap with the following row gets assigned to “A”, and all the other rows get assigned to “B”.
  • Normally, all rows per match will output every row, but {-b-} excludes the “B” rows from the output.

This code is logically equivalent to the analytic solution above. It also demonstrates how to avoid the knotty problem of how to exclude the current row from aggregates:

  1. Instead of comparing the aggregate to the current row, just back up a row.
  2. Now the aggregate contains exactly the rows you want.
  3. So compare that aggregate to the value of the row you care about, which is the “next” row.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s