Beyond Analytics: MODEL or MATCH_RECOGNIZE

Analytic functions are powerful and efficient, but sometimes they just aren’t enough. When you try analytics and they alone don’t solve the problem, it’s time to think about the MODEL clause – or upgrade to 12c and use MATCH_RECOGNIZE. All three can use partitions and ordering for simpler, more efficient processing. To illustrate, here’s a question from AskTom that Brendan Furey and I answered three different ways > “Group records that are within 6 hours of the first record in the group”.

The requirement

The input contains several records per ID, each record having a datetime called INS_DATE.

drop table GRTAB purge;
create table GRTAB (id number, INS_DATE date);
--
insert into GRTAB values(1,TO_DATE('2011-05-25 23:13:32','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 02:14:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 04:15:30','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 05:14:31','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 07:15:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 10:15:50','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 13:44:46','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 15:14:54','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 16:15:01','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 17:14:38','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 19:15:36','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 11:30:17','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 14:30:22','YYYY-MM-DD HH24:MI:SS'));

For each ID, the first group starts with the earliest record, whose INS_DATE becomes the BASE_DATE for that group. This group contains all the records that are within 6 hours of BASE_DATE. The next group starts with the earliest record after the previous group, whose INS_DATE becomes the BASE_DATE of that group.

ID INS_DATE BASE_DATE
1 2011-05-25 23:13:32 2011-05-25 23:13:32
1 2011-05-26 02:14:19 2011-05-25 23:13:32
1 2011-05-26 04:15:30 2011-05-25 23:13:32
1 2011-05-26 05:14:31 2011-05-26 05:14:31
1 2011-05-26 07:15:19 2011-05-26 05:14:31
1 2011-05-26 10:15:50 2011-05-26 05:14:31
1 2011-05-26 13:44:46 2011-05-26 13:44:46
1 2011-05-26 15:14:54 2011-05-26 13:44:46
1 2011-05-26 16:15:01 2011-05-26 13:44:46
1 2011-05-26 17:14:38 2011-05-26 13:44:46
1 2011-05-26 19:15:36 2011-05-26 13:44:46
2 2011-05-30 11:30:17 2011-05-30 11:30:17
2 2011-05-30 14:30:22 2011-05-30 11:30:17

 

Why use analytics? PARTITION and ORDER BY!

If we know and use analytic functions, we naturally think of using them for this kind of problem. Why? Because the first thing we want to do with the input is PARTITION it by ID, and the next thing we want to do is ORDER each partition by INS_DATE. Partitioning makes processing simpler and more efficient: we only have to think about one ID at a time, and Oracle can independently process the records for each ID. Once we have partitioned, we obviously have to order by INS_DATE.

Why not use analytics?

That is the question. Partitioning and ordering may help solve a problem, but that doesn’t mean analytics are the complete solution. How can we tell when analytics will work and when they won’t?

Analytic functions have input and output: the input is the result of the JOINs, GROUP BYs, etc., and it already contains all the rows of the final result set. An analytic function can use input data from rows before and after the current row, but it cannot access the output from analytics in other rows.

One way around this limitation is to nest analytic functions with subqueries or inline views. The “start of group” method does exactly this: it uses analytics to assign 1 to the first row of a group (and 0 to the others), then it uses the 1s and 0s as input to a SUM() function to generate the group identifiers. Here analytics works because we assign 1 or 0 based only on the input to the function: we don’t need to know whether any previous row got a 1 or a 0.

We cannot solve our current problem with nested analytics because we don’t know how much nesting to do. We would have to nest again every time the BASE_DATE changes!

No, this is a job for what Tom Kyte calls “procedural” processing, even if it can be done in SQL.

Who else does partitioning? The MODEL clause

The MODEL clause, introduced in Oracle 10G, is complex and therefore hard to understand. Still, we can use bits of its functionality to do processing that works like analytics but lets us be “procedural”.

SELECT id, ins_date, base_date
FROM grtab
MODEL
  PARTITION BY (id)
  DIMENSION BY (Row_Number() OVER (PARTITION BY id ORDER BY ins_date) rn)
  MEASURES (ins_date, ins_date base_date)
  RULES (
    base_date[rn > 1] =
      CASE
        WHEN ins_date[cv()] - base_date[cv()-1] > 0.25 THEN ins_date[cv()] 
        ELSE base_date[cv()-1]
      END
  )
  • Without a detailed explanation, we can see that the MODEL clause allows partitioning.
  • Then, instead of an ORDER BY, we use a DIMENSION that orders the data and assigns a sequential subscript to each row.
  • The MEASURES clause adds a BASE_DATE column that is initially populated with the INS_DATE value.
  • After that, our RULE says to go through every row except the first, and to update the BASE_DATE column. The rule says in effect: if my ins_date is more than 6 hours later than the previous row’s base_date, then assign my ins_date to my base_date; otherwise my base_date is the same as the previous row’s base_date.
    By “my” I mean the current row. The MODEL clause refers to the current row using the subscript cv(). Since the subscripts are sequential integers, we can refer to the previous row using cv()-1.

If you look closely, you’ll see that the rule doesn’t make sense unless it is executed in order, from the second row to the last. This is not a problem, since “the order defaults to the order of the columns as specified in the DIMENSION BY clause.”

The key advantage to the MODEL clause over analytics is that it orders not only the data, but the processing. Calculations are done in order, and they can use the results of prior calculations.

12c MATCH_RECOGNIZE partitions and orders by

Once we have Oracle 12c, we can use the MATCH_RECOGNIZE clause to solve problems that require nested analytics, or that analytics can’t solve at all.

SELECT * FROM grtab
MATCH_RECOGNIZE (
  PARTITION BY ID
  ORDER BY ins_date
  MEASURES FIRST(ins_date) base_date
  ALL ROWS PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A+)
  DEFINE A AS ins_date < FIRST(ins_date) + 0.25
)

Here we have our good old PARTITION and ORDER clauses, but unlike analytics they apply to the entire row and not just to one column. The PATTERN and DEFINE clauses unite in one “match” the first row and all following rows that are within 6 hours of the first row.

Then the “magic” happens: when the first “match” is done, the next match starts with the row right after the current match. This is what “after match skip past last row” means. I could have omitted the line because it is the default behavior, but it’s important enough to say explicitly. This instruction lets us “start over” after every match, based on the result of the previous match.

MATCH_RECOGNIZE is a kind of compromise between analytics and procedural processing: we can order the data like analytics, but we can also order the processing and (to some extent) base our logic on prior processing. However, the order of processing is the same as the order of data: we can’t go backwards, skip around or do loops. This type or processing is neither purely “set-based” nor fully “procedural”; I call it “stream-based” processing.

Performance

The execution plans for MODEL and MATCH_RECOGNIZE are pretty cryptic:

-------------------------------------
| Id  | Operation           | Name  |
-------------------------------------
|   0 | SELECT STATEMENT    |       |
|   1 |  SQL MODEL ORDERED  |       |
|   2 |   WINDOW SORT       |       |
|   3 |    TABLE ACCESS FULL| GRTAB |
-------------------------------------
-----------------------------------------------------------------
| Id  | Operation                                       | Name  |
-----------------------------------------------------------------
|   0 | SELECT STATEMENT                                |       |
|   1 |  VIEW                                           |       |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|       |
|   3 |    TABLE ACCESS FULL                            | GRTAB |
-----------------------------------------------------------------

We know that the table was only accessed once, so the I/O was efficient, but we don’t have many details on memory and CPU usage. Using Tom Kyte’s RUNSTATS package on a table with 212992 rows, I found that MATCH_RECOGNIZE ran 20% faster, used 20% less CPU and 40% less memory than MODEL.

Conclusion

Analytic functions let us work with data in other rows than the current row, but we cannot access the results of analytic functions in other rows. We can use the MODEL clause when analytics are not enough: the key is to use ROW_NUMBER() in the DIMENSION clause to assign sequential subscripts to the data and to order the processing. The MATCH_RECOGNIZE clause orders both data and processing and lets us use the results of prior processing, especially to start a new “match” based on where the previous match ended.

Down with Firefox

Warning: this post is not technical and it is not about Oracle.

Brendan Eich recently resigned under pressure from his job as CEO of Mozilla, the makers of Firefox. The reason given was a campaign contribution that Mr. Eich made in 2008.

The State of California has the “referendum”: a proposition is submitted to a yes or no vote, and if accepted it becomes law. This is pure democracy.

In 2008, Mr. Eich made a contribution to a group that opposed a proposition. In 2014, a group favorable to the proposition started a boycott of Firefox when Mr. Eich was promoted to CEO. Mozilla succumbed to pressure and asked Mr. Eich to resign, which he finally agreed to do.

It doesn’t matter what the proposition was about. This was an exercise in democracy and Mr. Eich was participating in the debate as a private individual (although he named his employer as required by law). If he had contributed to the opposite side, would he have been asked to leave? (Laughter)

Mr. Eich lost his job because he exercised his individual liberty on the “wrong side”. In protest, I am personally boycotting Firefox. This post and all further posts will be written using Google Chrome.

If you think that freedom of democratic debate should be limited to those who agree with you, then feel free to boycott this blog.

Overlapping ranges with priority

A few years ago, Alberto Dell’Era blogged about product prices with overlapping date ranges; “on an overlapping range, the strongest priority (lower number) wins.” His analysis, explanation and solution are all excellent. I’m revisiting the subject to suggest a few improvements (and mention MATCH_RECOGNIZE yet again).

To explain more easily, Alberto uses letters (a,b,c,d,e) instead of dates, and in the code he changes the letters to numbers using the ascii() function. (The logic is the same, whether the ranges are dates or something else.) He uses the term sku as a product identifier. “Prices of skus (i.e. goods) have validity ranges and can overlap; on an overlapping range, the strongest priority (lower number) wins.” Sample input:

b______(0,$200)_______d
b__________c______(1,$300)_______e

Notice that d and e are not colored, because they are not included in the range. Since priority 0 is stronger than 1, the expected output is:

b_______($200)________d
b__________c__________d__($300)__e

The Algorithm

Step Expected Output
1) Get distinct start and end points b,c,d,e
2) Create non-overlapping base ranges (Alberto has a not null end point, but in general nulls may be useful.) b__________c
b__________c__________d
b__________c__________d__________e
b__________c__________d__________e_(null)
3) Assign price/priority to base ranges b_(0,$200)_c
b__________c_(0,$200)_d
b__________c_(1,$300)_d
b__________c__________d_(1,$300)_e
4) For each base range, keep highest priority b__($200)__c
b__________c__($200)__d
b__________c__________d__($300)__e
5) Merge consecutive ranges with same price b_______($200)________d
b__________c__________d__($300)__e

1) Get distinct start and end points

Alberto uses UNION to get the start points, then the end points.

with instants as (
  select sku, a as i from ranges
  union
  select sku, b as i from ranges
),

This requires two scans of the index. I suggest using UNPIVOT, which does the work with one scan. I’ll show you that next.

2) Create non-overlapping base ranges

Alberto does this with the analytic LEAD function:

base_ranges as (
  select * from (
    select sku,
    i as ba,
    lead(i) over (partition by sku order by i) as bb
    from instants
  )
  where bb is not null
),

I do the same thing, but thanks to UNPIVOT I can do steps 1) and 2) with one SELECT:

with base_ranges as (
  select distinct sku, i ba,
  lead(i) over(partition by sku order by i, col) bb,
  dense_rank() over(partition by sku order by i) seq
  from ranges
  unpivot(i for col in(a,b))
),

Notice I add the column SEQ, which assigns consecutive integers to each date range. SEQ will allow me to use the Tabibitosan method in step 5) !

3) Assign price/priority to base ranges

We both do this by joining the base ranges to the original rows. Here’s Alberto’s code:

factored_ranges as (
  select i.sku, bi.ba, bi.bb, i.a, i.b, i.prio, i.price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba and bi.ba < i.b)  
),

4) For each base range, keep highest priority

Alberto uses the DENSE_RANK analytic function on each base range, giving a rank of 1 to the highest priority, then keeps only the rows where the rank is 1.

strongest_factored_ranges as (
  select sku, ba, bb, prio, price
  from (
    select sku, ba, bb, prio, price,
    dense_rank () over (partition by sku, ba, bb order by prio) as rnk
    from factored_ranges
  )
  where rnk = 1
),

I decided to use the KEEP (DENSE_RANK FIRST...) aggregate function. I can then combine the aggregation with another analytic function in the same SELECT. I’ll show this in a minute.

5) Merge consecutive ranges with same price

Alberto uses an equivalent to the “start of group” method to merge the prices.

-- STEP will be zero if a range can be joined to the previous one, since:
-- a) they are consecutive (no gap between them)
-- b) they have the same price
-- see http://www.oracle.com/technetwork/issue-archive/o24asktom-095715.html
ranges_with_step as (
  select sku, ba, bb, prio, price,
  decode (
    price,
    lag(price) over (partition by sku order by ba),
    ba - lag(bb) over (partition by sku order by ba),
    1000
  ) step
  from strongest_factored_ranges
), 
-- the integral of step over the ranges
-- joinable ranges will hence have the same value for "interval" since step is zero
ranges_with_step_integral as (
  select sku, ba, bb, prio, price, step,
  sum(step) over (
    partition by sku order by ba
    rows between unbounded preceding and current row
  ) as integral
  from ranges_with_step
), 
-- this joins the joinable ranges
ranges_joined as (
  select * from (
    select sku, ba, bb, prio, price, step, integral,
    min(ba) over (partition by sku, integral) as a,
    max(bb) over (partition by sku, integral) as b
    from ranges_with_step_integral
  )
  where step > 0 
)
select sku, chr(a) "START", chr(b) "END", price
from ranges_joined;

I’ve waited until now to show my code, because I do steps 3), 4) and part of 5) in the same SELECT.

strongest_factored_ranges as (
  select bi.sku,
  bi.seq -
    row_number() over(partition by bi.sku order by min(i.prio), bi.seq) grp,
  bi.ba a, bi.bb b, 
  min(i.price) keep (dense_rank first order by i.prio) price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba
  and (bi.bb <= i.b or bi.bb is null and i.b is null))
  group by bi.sku, bi.seq, bi.ba, bi.bb
)
select sku, chr(min(a)) "START", chr(max(b)) "END", price
from strongest_factored_ranges
group by sku, grp, price;
  • I changed the WHERE clause to allow for NULL end points, just to show how.
  • The KEEP (DENSE_RANK FIRST...) aggregate function gives me the price of the highest priority for each base range.
  • Once the aggregation is done, I can use an analytic function, and even refer to the result of the aggregation within the analytic. Here is where I use the Tabibitosan method to assign the same grp to consecutive ranges with the same price.

Using 12c MATCH_RECOGNIZE

With 12c row pattern matching, I don’t need the Tabibitosan method so I can simplify the preceding subqueries and use fewer analytic functions. Here is the entire solution:

-- steps 1) and 2)
with base_ranges as (
  select distinct sku, i ba,
  lead(i) over(partition by sku order by i, col) bb
  from ranges
  unpivot(i for col in(a,b))
)
-- steps 3) and 4)
, strongest_factored_ranges as (
  select bi.sku,
  bi.ba a, bi.bb b, 
  min(i.price) keep (dense_rank first order by i.prio) price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba
  and (bi.bb <= i.b or bi.bb is null and i.b is null))
  group by bi.sku, bi.ba, bi.bb
)
-- step 5)
select * from strongest_factored_ranges
match_recognize(
  partition by sku order by a, b
  measures chr(first(a)) "START", chr(last(b)) "END",
    first(price) price
  pattern(A B*)
  define B as (a, price) = ((prev(b), a.price))
);

Performance

You may be a bit disappointed in my performance findings: I certainly am. The fact is, using UNPIVOT instead of UNION in step 1) is clearly better, because it does one index scan instead of two. Other than that, Alberto’s solution and my pre-12c “improvements” have equivalent performance and use about the same resources. On the other hand, MATCH_RECOGNIZE is a little slower because it uses more CPU. In conclusion, in version 11.2 I would keep Alberto’s solution (except for that first UNION) because it is easier to maintain, and I would use MATCH_RECOGNIZE in version 12c for the same reason (unless the bit of extra CPU is a problem).

Merging contiguous date ranges

Last time I wrote about finding gaps in date ranges: this post is about merging date ranges that “meet”. This is a frequent question; the answer applies to any ranges, not just dates. As a reminder, I consider ranges to “meet” when the end value of one record is equal to the start value of the following record.

Some test data, using date ranges:

drop table t;
create table t ( id int, start_date date, end_date date );
Insert into T values (1, DATE '2014-01-01', DATE '2014-01-03');
Insert into T values (2, DATE '2014-01-03', DATE '2014-01-05');
Insert into T values (3, DATE '2014-01-05', DATE '2014-01-07');
Insert into T values (4, DATE '2014-01-08', DATE '2014-02-01');
Insert into T values (5, DATE '2014-02-01', DATE '2014-02-10');
Insert into T values (6, DATE '2014-02-05', DATE '2014-02-28');
Insert into T values (7, DATE '2014-02-10', DATE '2014-02-15');
ID START_DATE END_DATE
1 2014-01-01 00:00:00 2014-01-03 00:00:00
2 2014-01-03 00:00:00 2014-01-05 00:00:00
3 2014-01-05 00:00:00 2014-01-07 00:00:00
4 2014-01-08 00:00:00 2014-02-01 00:00:00
5 2014-02-01 00:00:00 2014-02-10 00:00:00
6 2014-02-05 00:00:00 2014-02-28 00:00:00
7 2014-02-10 00:00:00 2014-02-15 00:00:00

I expect columns 1 through 3 to be merged, as well as columns 4 and 5. The requirement says to merge contiguous ranges, not overlapping ranges, so columns 6 and 7 should stay as they are.

The “start of group” method

I wrote about the “start of group” method a week ago.

  • The first step is to assign 1 to each row that starts a group and 0 to every other row. Here the test is simple: a group starts when the current start date is not equal to the previous end date.
  • The next step is to assign a group to each row using a running sum of the 1s and 0s.
  • Finally, I do the group by.
with grp_starts as (
  select start_date, end_date,
  case
    when start_date = lag(end_date) over(order by start_date, end_date)
    then 0 else 1
  end grp_start
  from t
)
, grps as (
  select start_date, end_date,
  sum(grp_start) over(order by start_date, end_date) grp
  from grp_starts
)
select min(start_date) start_date,
max(end_date) end_date
from grps
group by grp
order by 1, 2;
START_DATE END_DATE
2014-01-01 00:00:00 2014-01-07 00:00:00
2014-01-08 00:00:00 2014-02-10 00:00:00
2014-02-05 00:00:00 2014-02-28 00:00:00
2014-02-10 00:00:00 2014-02-15 00:00:00

Using 12c row pattern matching

Here the “pattern” is: any one row (A), followed by zero or more rows (B*) whose start date is equal to the previous end date. By default:

  • the next match starts with the row just after the prior match;
  • only one row per match is returned;
  • since A is not defined, the implicit condition is TRUE: the row always matches.
select * from t
match_recognize(
  order by start_date, end_date
  measures first(start_date) start_date, last(end_date) end_date
  pattern(A B*)
  define B as start_date = prev(end_date)
);

Merging overlapping rows

Suppose the requirement is to merge rows that overlap as well as rows that “meet”. With the “start of group” method, I change my first test from “=” to “<=” and I’m done. With MATCH_RECOGNIZE, I change DEFINE B from “=” to “<=” and I return max(end_date) instead of last(end_date).

select * from t
match_recognize(
  order by start_date, end_date
  measures first(start_date) start_date, max(end_date) end_date
  pattern(A B*)
  define B as start_date <= prev(end_date)
);
START_DATE END_DATE
2014-01-01 00:00:00 2014-01-07 00:00:00
2014-01-08 00:00:00 2014-02-28 00:00:00

Update: using the Tabibitosan method

Update 2: Warning! Frank’s method doesn’t work when there are both gaps and overlaps in the date ranges. I don’t recommend it.

Just today on the OTN forum, Frank Kulash taught me how to apply this method, which he calls “Fixed Difference”, to date ranges. I think his solution is more efficient, since it does fewer window sorts, but it only works with number and date ranges, not timestamps — and it doesn’t work when there are both gaps and overlaps in the date ranges.

First, here is the subquery that calculates the groups:

select start_date, end_date,
end_date -
  sum(end_date - start_date) over(order by start_date, end_date)
as grp
from t
START_DATE END_DATE GRP
2014-01-01 2014-01-03 2014-01-01
2014-01-03 2014-01-05 2014-01-01
2014-01-05 2014-01-07 2014-01-01
2014-01-08 2014-02-01 2014-01-02
2014-02-01 2014-02-10 2014-01-02
2014-02-05 2014-02-28 2013-12-28
2014-02-10 2014-02-15 2013-12-10

The first “grp” is equal to start_date, since we add end_date and subtract it. In the second row, we are using the second end_date instead of the first, which is like adding the difference between the two end_dates. At the same time, we are subtracting the difference between the second start_date and the second end_date. So, if the second start_date is the same as the first end_date, the result will be the same: the “grp” will be the first start_date of the contiguous range.

After that, we just need to “group by”.

with grps as (
  select start_date, end_date,
  end_date -
    sum(end_date - start_date) over(order by start_date, end_date)
  as grp
  from t
)
select min(start_date) start_date,
max(end_date) end_date
from grps
group by grp
order by 1, 2;

Unfortunately, when there are both gaps and overlaps, they can cancel each other out and a row may be considered part of a contiguous series when in fact it is not. For an example, see the forum thread.

Gaps in Date Ranges: when are you free?

Calculating “free time” in Calendars is a very common task. Did you know it can be done by a SQL statement less than 100 bytes long? This post pays homage to an incredibly neat and concise use of analytic functions.

The Question

“Free time” in a Calendar means the unused time slots between events; in other words it is the set of gaps between the date ranges of those events. Back in 2007, a reader asked Tom Kyte for a SQL Query to find gaps in date ranges.Here is his input data:

create table t ( a int, b date, c date );
Insert into T (A,B,C) values (1, DATE '2007-01-01', DATE '2007-01-15');
Insert into T (A,B,C) values (2, DATE '2007-01-03', DATE '2007-01-10');
Insert into T (A,B,C) values (3, DATE '2007-01-12', DATE '2007-01-25');
Insert into T (A,B,C) values (4, DATE '2007-01-20', DATE '2007-02-01');
Insert into T (A,B,C) values (5, DATE '2007-02-05', DATE '2007-02-10');
Insert into T (A,B,C) values (6, DATE '2007-02-05', DATE '2007-02-28');
Insert into T (A,B,C) values (7, DATE '2007-02-10', DATE '2007-02-15');
Insert into T (A,B,C) values (8, DATE '2007-02-18', DATE '2007-02-23');
Insert into T (A,B,C) values (9, DATE '2007-02-22', DATE '2007-03-16');

And his expected output:

GAP_START GAP_END
2007-02-02 2007-02-04

The Answer

The first thing to notice is that the reader is asking for the wrong answer – or let’s say a different answer than I would expect. The only gap that exists here is between the record ending 2007-02-01 and the record starting 2007-02-05. As I explained in a previous post on date ranges, the end point is really 2007-02-01 00:00:00; I take it to be exclusive, meaning the gap should start at that same moment. In the same way, the gap should end at 2007-02-05 00:00:00, the moment when the next event starts. I would expect this output:

GAP_START GAP_END
2007-02-01 00:00:00 2007-02-05 00:00:00

The person who asked the question probably thinks in terms of dates without times. For him, a range ending on 2007-02-01 includes any time on that date, so his gap would start the next day. I strongly recommend against this approach, because it only works for “pure” dates and not for date/time values, and it is not compatible with the “Temporal Validity” support introduced by Oracle version 12c.

The Algorithm

Before revealing the code, let me explain how I think the author, Antony Boucher, came up with his solution.

Let’s suppose there are no overlaps between records. In that case,

  • The end date of the current record is the start date of the “gap”;
  • The start date of the next record is the end date of the “gap”;
  • If the “gap” has the same start and end dates, we can just eliminate that record.
SELECT * FROM
  (SELECT C D, LEAD(B) OVER (ORDER BY B) E FROM T)
WHERE D < E

So what do we do about the overlaps? Instead of using the end date of the current record, use the latest end date up to and including the current record.

SELECT * FROM
  (SELECT MAX(C) OVER (ORDER BY B) D, LEAD(B) OVER (ORDER BY B) E FROM T)
WHERE D < E
D E
2007-02-01 00:00:00 2007-02-05 00:00:00

And that is Antony’s solution. Its length: 99 bytes!

SQL for date ranges, gaps and overlaps

There are lots of questions and posts on the Web about data with date ranges. It’s harder than it seems to design tables with start and end dates, to avoid overlaps or even to write SQL to find gaps or overlaps. I’m writing this post to get some basic ideas straight and to make myself some rules.

Comparing date ranges

I noticed a detailed analysis of date ranges and their relations here > Allen’s Interval Algebra. I’m going to say “date range” where the article says “interval”, since for Oracle an INTERVAL is just a duration without a defined starting point. (Update: Galo Baldo noticed a mistake in the text I quote here. I underlined the part he corrected.)

“This table shows all the possible relations that two date ranges can have. Each one is defined graphically by a diagram relating two ranges a and b, with time running → from left to right. For example, the first diagram shows that “a precedes b” means that a ends before b begins, with a gap separating them; the second shows that “a meets b” means that a ends when b begins.”

precedes meets overlaps finished by contains starts equals started by during finishes overlapped by met by preceded by
p m o fi di s e si d f oi mi pi
gap meet ————– overlap ————– meet gap

I love a thorough analysis like this! For our purposes, I added the third row that simplifies to three situations: “gap”, “meet” and “overlap”.

Designing tables with date ranges: not so easy

Let’s create a very simple table with one row that covers March 11th and another that covers March 12th:

create table Date_Ranges(
  start_time timestamp,
  end_time timestamp
);
insert into Date_Ranges
select TIMESTAMP '2014-03-11 00:00:00',
TIMESTAMP '2014-03-12 00:00:00'
from dual
union all
select TIMESTAMP '2014-03-12 00:00:00',
TIMESTAMP '2014-03-13 00:00:00'
from dual;
commit;
START_TIME END_TIME
2014-03-11 00:00:00.000000 2014-03-12 00:00:00.000000
2014-03-12 00:00:00.000000 2014-03-13 00:00:00.000000

These rows should “meet” because the START_TIME of one = the END_TIME of another. Now let’s see what happens at midnight:

select * from Date_Ranges
where TIMESTAMP '2014-03-12 00:00:00' between start_time and end_time;
START_TIME END_TIME
2014-03-11 00:00:00.000000 2014-03-12 00:00:00.000000
2014-03-12 00:00:00.000000 2014-03-13 00:00:00.000000

We got both rows back. It appears that these two rows “overlap” at midnight! Maybe we should make END_TIME end just before midnight:

Update Date_Ranges
set end_time = end_time - numtodsinterval('1', 'second');
--
select start_time, end_time,
lead(start_time) over(order by start_time) - end_time gap
from Date_Ranges
START_TIME END_TIME GAP
2014-03-11 00:00:00.000000 2014-03-11 23:59:59.000000 +00 00:00:01.
2014-03-12 00:00:00.000000 2014-03-12 23:59:59.000000

Now we have a “gap” between the rows!

It looks like the only way to make rows “meet” is to define START_TIME as inclusive and END_TIME as exclusive: We go back to our original table data but we change our SQL.

rollback;
select * from Date_Ranges
where start_time <= TIMESTAMP '2014-03-12 00:00:00'
                and TIMESTAMP '2014-03-12 00:00:00' < end_time;
START_TIME END_TIME
2014-03-12 00:00:00.000000 2014-03-13 00:00:00.000000
  • Rule 1: to make date ranges “meet”, make the START_TIME of the later row equal to the END_TIME of the earlier row.
  • Rule 2: make END_TIMEs exclusive by using
    “>= START_TIME” and “< END_TIME” instead of BETWEEN.

What do NULLs mean?

You think this is a trick question, right? Because NULLs aren’t supposed to “mean” anything. This is true, but very often an END_TIME with a null value means “until the end of time”. Less often, a null START_TIME means “starting at the beginning of time”. If you decide to use NULL in this way, remember that Oracle does not create index entries with entirely NULL values.

  • Rule 3: if necessary, allow NULL values for START_TIME (for the indefinite past) and / or END_TIME (for the indefinite future).

Oracle 12c agrees

Oracle 12c introduced support for “Temporal Validity”. As it turns out, the SQL that Oracle generates behind the scenes corresponds exactly to my 3 “rules”:

alter table Date_Ranges modify period for valid_time(start_time, end_time);

select * from Date_Ranges
AS OF PERIOD FOR valid_time TIMESTAMP '2014-03-12 00:00:00';

select * from table(dbms_xplan.display_cursor);

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |       |       |     3 (100)|          |
|*  1 |  TABLE ACCESS FULL| DATE_RANGES |     1 |    26 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter((("T"."START_TIME" IS NULL OR "T"."START_TIME"<=TIMESTAMP'
              2014-03-12 00:00:00.000000000') AND ("T"."END_TIME" IS NULL OR
              "T"."END_TIME">TIMESTAMP' 2014-03-12 00:00:00.000000000')))

12c MATCH_RECOGNIZE and the “start of group” method

Please don’t think this post is “just” about row pattern matching and MATCH_RECOGNIZE. I am going to present the “start of group” method as demonstrated by Timur Akhmadeev and Solomon Yakobson. This is a powerful use of analytics to group data based on comparisons between adjacent rows. It’s a technique worth knowing while waiting for 12c, when the MATCH_RECOGNIZE clause will make this much easier :)

Grouping events that are no more than 3 seconds apart

The first example comes from a Tom Kyte article in Oracle Magazine (scroll down to “Analytics to the rescue”). The input data looks like this:

create table t(time, amount) as select
to_date('11/22/2003 12:22:01', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select
to_date('11/22/2003 12:22:03', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select
to_date('11/22/2003 12:22:04', 'mm/dd/yyyy hh24:mi;ss'), 300 from dual union all select
to_date('11/22/2003 12:22:06', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select
to_date('11/22/2003 12:22:45', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select
to_date('11/22/2003 12:22:46', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select
to_date('11/22/2003 12:23:12', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select
to_date('11/22/2003 12:23:22', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual;

alter session set nls_date_format ='hh24:mi:ss';
TIME AMOUNT
12:22:01 100
12:22:03 200
12:22:04 300
12:22:06 200
12:22:45 100
12:22:46 200
12:23:12 100
12:23:22 200

The requirement is to group together rows whose time value is within three seconds of the previous or following row; for each group, output the first time, the last time and the sum of the amounts. The expected output is:

FIRST_TIME LAST_TIME SUM_AMOUNT
12:22:01 12:22:06 800
12:22:45 12:22:46 300
12:23:12 12:23:12 100
12:23:22 12:23:22 200

If the grouped rows were supposed to be exactly 3 seconds apart, I could treat them as “consecutive” and use the excellent Tabibitosan method. Since that is not the case, I have to use the “start of group” method, which has more steps. First it determines whether each row starts a new group or not.

select time, amount,
case
  when time - lag(time,1,time-1) over (order by time) > 3/24/60/60
  then 1
  else 0
end grp_start
from t
TIME AMOUNT GRP_START
12:22:01 100 1
12:22:03 200 0
12:22:04 300 0
12:22:06 200 0
12:22:45 100 1
12:22:46 200 0
12:23:12 100 1
12:23:22 200 1

Using the LAG analytic function, I test the time difference between the current row and the previous row (the third argument, time-1, is the value to use on the first row when LAG returns null.) If the difference is more that 3 seconds, the row must start a new group so I assign a value of 1; otherwise I assign 0.

Now here’s the fun part: I use the analytic SUM function on the above result to get a running total. Look what happens:

with a as (
  select time, amount,
  case
    when time - lag(time,1,time-1) over (order by time) > 3/24/60/60
    then 1
    else 0
  end grp_start
  from t
)
select time, amount, grp_start,
sum(grp_start) over (order by time) grp
from a
TIME AMOUNT GRP_START GRP
12:22:01 100 1 1
12:22:03 200 0 1
12:22:04 300 0 1
12:22:06 200 0 1
12:22:45 100 1 2
12:22:46 200 0 2
12:23:12 100 1 3
12:23:22 200 1 4

The running total has given us the group value we wanted. All we have to do now is group by it.

with a as (
  select time, amount,
  case
    when time - lag(time,1,time-1) over (order by time) > 3/24/60/60
    then 1
    else 0
  end grp_start
  from t
), b as (
  select time, amount, grp_start,
  sum(grp_start) over (order by time) grp
  from a
)
select min(time) first_time, max(time) last_time, sum(amount) sum_amount
from b
group by grp
order by 1;
FIRST_TIME LAST_TIME SUM_AMOUNT
12:22:01 12:22:06 800
12:22:45 12:22:46 300
12:23:12 12:23:12 100
12:23:22 12:23:22 200

This method is a bit less efficient than Tabibitosan, because it needs two analytic functions to assign groups to the rows, but it can be used much more often. As long as we can identify the start of each new group using analytics, this method will work.

The “row pattern matching” solution

I have already explained the basic syntax in my previous posts. Here I am taking advantage of the defaults to write the code as concisely as possible.

SELECT * FROM t
MATCH_RECOGNIZE(
  ORDER BY time
  MEASURES FIRST(time) first_time, LAST(time) last_time, SUM(amount) sum_amount
  PATTERN (a b*)
  DEFINE b AS time <= PREV(time) + 3/24/60/60
);

What’s the difference between this solution and the one I used in an earlier post to group consecutive values? Just one: in the DEFINE, I used “=” before and now I use “<=”.

Tabibitosan and “start of group” are two quite different techniques. Before 12c, you have to know both and know when to use each one. With 12c MATCH_RECOGNIZE, you just say what you want and then you get it.

Warning

In the post I linked to, Timur used the “start of group” technique on dba_extents and dba_free_space. Of course I had to redo his code using MATCH_RECOGNIZE. As I write this, my code is still running…wait, it’s not running anymore but the VM with my database just disappeared. I tried again and the VM disappeared again. Let’s be careful out there!

UPDATE: after setting the PROCESSES parameter to something more reasonable, my database and VM are stable but my code runs forever. The trace file shows a wait on ‘resmgr: cpu quantum’.