Please don’t think this post is “just” about row pattern matching and MATCH_RECOGNIZE
. I am going to present the “start of group” method as demonstrated by Timur Akhmadeev and Solomon Yakobson. This is a powerful use of analytics to group data based on comparisons between adjacent rows. It’s a technique worth knowing while waiting for 12c, when the MATCH_RECOGNIZE clause will make this much easier :)
Grouping events that are no more than 3 seconds apart
The first example comes from a Tom Kyte article in Oracle Magazine (scroll down to “Analytics to the rescue”). The input data looks like this:
create table t(time, amount) as select to_date('11/22/2003 12:22:01', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select to_date('11/22/2003 12:22:03', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select to_date('11/22/2003 12:22:04', 'mm/dd/yyyy hh24:mi;ss'), 300 from dual union all select to_date('11/22/2003 12:22:06', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select to_date('11/22/2003 12:22:45', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select to_date('11/22/2003 12:22:46', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select to_date('11/22/2003 12:23:12', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select to_date('11/22/2003 12:23:22', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual; alter session set nls_date_format ='hh24:mi:ss';
TIME | AMOUNT |
---|---|
12:22:01 | 100 |
12:22:03 | 200 |
12:22:04 | 300 |
12:22:06 | 200 |
12:22:45 | 100 |
12:22:46 | 200 |
12:23:12 | 100 |
12:23:22 | 200 |
The requirement is to group together rows whose time value is within three seconds of the previous or following row; for each group, output the first time, the last time and the sum of the amounts. The expected output is:
FIRST_TIME | LAST_TIME | SUM_AMOUNT |
---|---|---|
12:22:01 | 12:22:06 | 800 |
12:22:45 | 12:22:46 | 300 |
12:23:12 | 12:23:12 | 100 |
12:23:22 | 12:23:22 | 200 |
If the grouped rows were supposed to be exactly 3 seconds apart, I could treat them as “consecutive” and use the excellent Tabibitosan method. Since that is not the case, I have to use the “start of group” method, which has more steps. First it determines whether each row starts a new group or not.
select time, amount, case when time - lag(time,1,time-1) over (order by time) > 3/24/60/60 then 1 else 0 end grp_start from t
TIME | AMOUNT | GRP_START |
---|---|---|
12:22:01 | 100 | 1 |
12:22:03 | 200 | 0 |
12:22:04 | 300 | 0 |
12:22:06 | 200 | 0 |
12:22:45 | 100 | 1 |
12:22:46 | 200 | 0 |
12:23:12 | 100 | 1 |
12:23:22 | 200 | 1 |
Using the LAG
analytic function, I test the time difference between the current row and the previous row (the third argument, time-1, is the value to use on the first row when LAG
returns null.) If the difference is more that 3 seconds, the row must start a new group so I assign a value of 1; otherwise I assign 0.
Now here’s the fun part: I use the analytic SUM function on the above result to get a running total. Look what happens:
with a as ( select time, amount, case when time - lag(time,1,time-1) over (order by time) > 3/24/60/60 then 1 else 0 end grp_start from t ) select time, amount, grp_start, sum(grp_start) over (order by time) grp from a
TIME | AMOUNT | GRP_START | GRP |
---|---|---|---|
12:22:01 | 100 | 1 | 1 |
12:22:03 | 200 | 0 | 1 |
12:22:04 | 300 | 0 | 1 |
12:22:06 | 200 | 0 | 1 |
12:22:45 | 100 | 1 | 2 |
12:22:46 | 200 | 0 | 2 |
12:23:12 | 100 | 1 | 3 |
12:23:22 | 200 | 1 | 4 |
The running total has given us the group value we wanted. All we have to do now is group by it.
with a as ( select time, amount, case when time - lag(time,1,time-1) over (order by time) > 3/24/60/60 then 1 else 0 end grp_start from t ), b as ( select time, amount, grp_start, sum(grp_start) over (order by time) grp from a ) select min(time) first_time, max(time) last_time, sum(amount) sum_amount from b group by grp order by 1;
FIRST_TIME | LAST_TIME | SUM_AMOUNT |
---|---|---|
12:22:01 | 12:22:06 | 800 |
12:22:45 | 12:22:46 | 300 |
12:23:12 | 12:23:12 | 100 |
12:23:22 | 12:23:22 | 200 |
This method is a bit less efficient than Tabibitosan, because it needs two analytic functions to assign groups to the rows, but it can be used much more often. As long as we can identify the start of each new group using analytics, this method will work.
The “row pattern matching” solution
I have already explained the basic syntax in my previous posts. Here I am taking advantage of the defaults to write the code as concisely as possible.
SELECT * FROM t MATCH_RECOGNIZE( ORDER BY time MEASURES FIRST(time) first_time, LAST(time) last_time, SUM(amount) sum_amount PATTERN (a b*) DEFINE b AS time <= PREV(time) + 3/24/60/60 );
What’s the difference between this solution and the one I used in an earlier post to group consecutive values? Just one: in the DEFINE
, I used “=” before and now I use “<=”.
Tabibitosan and “start of group” are two quite different techniques. Before 12c, you have to know both and know when to use each one. With 12c MATCH_RECOGNIZE
, you just say what you want and then you get it.
Warning
In the post I linked to, Timur used the “start of group” technique on dba_extents and dba_free_space. Of course I had to redo his code using MATCH_RECOGNIZE
. As I write this, my code is still running…wait, it’s not running anymore but the VM with my database just disappeared. I tried again and the VM disappeared again. Let’s be careful out there!
UPDATE: after setting the PROCESSES parameter to something more reasonable, my database and VM are stable but my code runs forever. The trace file shows a wait on ‘resmgr: cpu quantum’.
Update 2014-08-08: using Oracle version 12.2.0.2.0, my code runs in less than 9 seconds, whereas Timur’s version runs in 1 second. Getting better…