Merging Overlapping Date Ranges with MATCH_RECOGNIZE

I forgot to add a MATCH_RECOGNIZE solution to my last post on merging overlapping date ranges! That should take me just a few minutes, right? Wrong: it’s not that easy and here’s why.

UPDATE 2018-06-24: “mathguy”, a top contributor to OTN/ODC, pointed out that the “workaround” I originally published was not correct. My thanks to him and my apologies to my readers! Better solution below…

For test data, please refer to the previous post.

To Merge or not to Merge?

The idea is to merge date ranges if they meet or overlap either partially or completely – in other words, we keep merging as long as there is not a gap. How do we know whether or not a gap exists?

  • We sort by start date, then end date.
  • The current row should be merged if its start date is less than or equal to
    the most recent end date, up to but not including the current row.
  • Again, the secret is to compare the most recent end date, not just the end date on the previous line.

The Analytics solution

With analytics, the “windowing clause” lets us do exactly what we want: when we say rows between unbounded preceding and 1 preceding then Oracle will apply the max() function to all the preceding rows, up to but not including the current row. We simply compare the current start date with that maximum preceding end date and we know whether or not to merge.

The MATCH_RECOGNIZE difficulty

MATCH_RECOGNIZE, available starting in version 12c, is pretty powerful and does things analytics can’t do, but it doesn’t really have a windowing clause. When doing comparisons, this can be a problem.

Let’s try a simple test:

select * from t
  partition by test_case
  order by start_date, end_date
  measures match_number() match, max(end_date) max_date
  all rows per match
  define a as start_date <= max(a.end_date)
where rownum <= 2;

Here I'm comparing the start_date with the maximum date already assigned to A – no I’m not! That is not the way the DEFINE clause works. It assumes the current row belongs to A, then does the test based on that assumption. As a result, the end date from the current row is included in the computation. Look at the result:

01:precedes 2015-06-01 2015-06-02 1 2015-06-02
01:precedes 2015-06-03 2015-06-04 1 2015-06-04

The “match number” is the same, which means the second line will be merged with the first – and all because the start date is being compared to the wrong end date.

The Preferred Solution

I can’t compare the current start date with the maximum preceding end date – but I can compare the current maximum end date with the next start date! As long as that comparison is true, both the current row and the next row should be part of the match.

select * from t
  partition by test_case
  order by start_date, end_date
  measures first(start_date) start_date, max(end_date) end_date
  pattern(a* b)
  define a as max(end_date) >= next(start_date)

There are two special cases to consider:

  1. The a comparison fails at the first row: in that case, the match should contain the first row only since it meets the implicit b, which is always true. The pattern(a* b) will ensure that.
  2. There is only one row left, so next(start_date) will be null. The a comparison will fail but that row will match the b condition, which is what we want.

The Workaround – NOT

UPDATE 2018-06-24: this solution is not correct. Please see the comment by “mathguy” for an explanation.

MATCH_RECOGNIZE does give you access to other rows than the current row, using PREV, NEXT, FIRST and LAST. In this case, I could say LAST(end_date,1) and get the end date from the preceding matched line – or I could just say PREV(end_date) for the same result.

So how can I be sure that the preceding end date is the most recent? MATCH_RECOGNIZE doesn’t let you use analytic functions around LAST or PREV. I’ll just have to change the ORDER BY and sort by end date first. When I do, it finally works:

select * from t
  partition by test_case
  order by end_date, start_date
  measures min(start_date) start_date, max(end_date) end_date
  pattern(a b*)
  define b as start_date <= last(end_date,1)
01:precedes 2015-06-01 2015-06-02
01:precedes 2015-06-03 2015-06-04
02:meets 2015-06-01 2015-06-03
03:overlaps 2015-06-01 2015-06-04
04:finished by 2015-06-01 2015-06-03
05:contains 2015-06-01 2015-06-04
06:starts 2015-06-01 2015-06-03
07:equals 2015-06-01 2015-06-02
08:started by 2015-06-01 2015-06-03
09:during 2015-06-01 2015-06-04
10:finishes 2015-06-01 2015-06-03
11:overlapped by 2015-06-01 2015-06-04
12:met by 2015-06-01 2015-06-03
13:preceded by 2015-06-01 2015-06-02
13:preceded by 2015-06-03 2015-06-04


6 thoughts on “Merging Overlapping Date Ranges with MATCH_RECOGNIZE

  1. Pingback: Log Buffer #427: A Carnival of the Vanities for DBAs | InsideMySQL

  2. Pingback: Log Buffer #427: A Carnival of the Vanities for DBAs | MySQL

  3. Regarding the “workaround” to merging overlapping intervals: I don’t believe the solution you propose is correct. To see why, let’s work with number intervals for simplicity (instead of times), and consider a single partition (a single test case). Let the intervals be [1, 9], [3, 5] and [6, 8]. If you order by (end, start) then the order is [3,5] followed by [6, 8] followed by [1, 9]. The first interval is classified as A. The second is NOT classified as B; instead, it becomes the A of the second match. The query you propose will return two rows, instead of a single row.

    • Hi mathguy,

      You are absolutely right!

      I later thought of an alternative solution that I liked a bit better, but decided not to change this post because I thought both proposals were correct. My logic was faulty here and my test cases were incomplete…

      Thanks very much for pointing this out. I am changing the post.

      Best regards,

  4. Let’s give a try to this one:

    match_recognize (
    partition by your_keys,
    order by start_date, end_date
    measures first(start_date) start_date, max(end_date) end_date
    pattern(B* A)
    define B as max(end_date) >= next(start_date)

    I found it more robust against overlapping intervals and patterns in W for the end_date.

    • Hi Pascal,

      It’s good to see more people getting used to the MATCH_RECOGNIZE clause!

      Your solution seems perfect. It also seems to be the same as the solution I put in the paragraph entitled “The Preferred Solution”. That is my preferred solution.

      Maybe my post is not clear because of the corrections I made.

      Best regards,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s