I forgot to add a MATCH_RECOGNIZE solution to my last post on merging overlapping date ranges! That should take me just a few minutes, right? Wrong: it’s not that easy and here’s why.
UPDATE 2018-06-24: “mathguy”, a top contributor to OTN/ODC, pointed out that the “workaround” I originally published was not correct. My thanks to him and my apologies to my readers! Better solution below…
For test data, please refer to the previous post.
To Merge or not to Merge?
The idea is to merge date ranges if they meet or overlap either partially or completely – in other words, we keep merging as long as there is not a gap. How do we know whether or not a gap exists?
- We sort by start date, then end date.
- The current row should be merged if its start date is less than or equal to
the most recent end date, up to but not including the current row.
- Again, the secret is to compare the most recent end date, not just the end date on the previous line.
The Analytics solution
With analytics, the “windowing clause” lets us do exactly what we want: when we say
rows between unbounded preceding and 1 preceding then Oracle will apply the
max() function to all the preceding rows, up to but not including the current row. We simply compare the current start date with that maximum preceding end date and we know whether or not to merge.
The MATCH_RECOGNIZE difficulty
MATCH_RECOGNIZE, available starting in version 12c, is pretty powerful and does things analytics can’t do, but it doesn’t really have a windowing clause. When doing comparisons, this can be a problem.
Let’s try a simple test:
select * from t match_recognize( partition by test_case order by start_date, end_date measures match_number() match, max(end_date) max_date all rows per match pattern(a+) define a as start_date <= max(a.end_date) ) where rownum <= 2;
Here I'm comparing the start_date with the maximum date already assigned to A – no I’m not! That is not the way the
DEFINE clause works. It assumes the current row belongs to A, then does the test based on that assumption. As a result, the end date from the current row is included in the computation. Look at the result:
The “match number” is the same, which means the second line will be merged with the first – and all because the start date is being compared to the wrong end date.
The Preferred Solution
I can’t compare the current start date with the maximum preceding end date – but I can compare the current maximum end date with the next start date! As long as that comparison is true, both the current row and the next row should be part of the match.
select * from t match_recognize( partition by test_case order by start_date, end_date measures first(start_date) start_date, max(end_date) end_date pattern(a* b) define a as max(end_date) >= next(start_date) );
There are two special cases to consider:
acomparison fails at the first row: in that case, the match should contain the first row only since it meets the implicit
b, which is always true. The
pattern(a* b)will ensure that.
- There is only one row left, so
next(start_date)will be null. The
acomparison will fail but that row will match the
bcondition, which is what we want.
The Workaround – NOT
UPDATE 2018-06-24: this solution is not correct. Please see the comment by “mathguy” for an explanation.
MATCH_RECOGNIZE does give you access to other rows than the current row, using
PREV, NEXT, FIRST and
LAST. In this case, I could say
LAST(end_date,1) and get the end date from the preceding matched line – or I could just say
PREV(end_date) for the same result.
So how can I be sure that the preceding end date is the most recent?
MATCH_RECOGNIZE doesn’t let you use analytic functions around
PREV. I’ll just have to change the
ORDER BY and sort by end date first. When I do, it finally works:
select * from t match_recognize( partition by test_case order by end_date, start_date measures min(start_date) start_date, max(end_date) end_date pattern(a b*) define b as start_date <= last(end_date,1) );