12c MATCH_RECOGNIZE and the “start of group” method

Please don’t think this post is “just” about row pattern matching and MATCH_RECOGNIZE. I am going to present the “start of group” method as demonstrated by Timur Akhmadeev and Solomon Yakobson. This is a powerful use of analytics to group data based on comparisons between adjacent rows. It’s a technique worth knowing while waiting for 12c, when the MATCH_RECOGNIZE clause will make this much easier :)

Grouping events that are no more than 3 seconds apart

The first example comes from a Tom Kyte article in Oracle Magazine (scroll down to “Analytics to the rescue”). The input data looks like this:

create table t(time, amount) as select
to_date('11/22/2003 12:22:01', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select
to_date('11/22/2003 12:22:03', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select
to_date('11/22/2003 12:22:04', 'mm/dd/yyyy hh24:mi;ss'), 300 from dual union all select
to_date('11/22/2003 12:22:06', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select
to_date('11/22/2003 12:22:45', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select
to_date('11/22/2003 12:22:46', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual union all select
to_date('11/22/2003 12:23:12', 'mm/dd/yyyy hh24:mi;ss'), 100 from dual union all select
to_date('11/22/2003 12:23:22', 'mm/dd/yyyy hh24:mi;ss'), 200 from dual;

alter session set nls_date_format ='hh24:mi:ss';
TIME AMOUNT
12:22:01 100
12:22:03 200
12:22:04 300
12:22:06 200
12:22:45 100
12:22:46 200
12:23:12 100
12:23:22 200

The requirement is to group together rows whose time value is within three seconds of the previous or following row; for each group, output the first time, the last time and the sum of the amounts. The expected output is:

FIRST_TIME LAST_TIME SUM_AMOUNT
12:22:01 12:22:06 800
12:22:45 12:22:46 300
12:23:12 12:23:12 100
12:23:22 12:23:22 200

If the grouped rows were supposed to be exactly 3 seconds apart, I could treat them as “consecutive” and use the excellent Tabibitosan method. Since that is not the case, I have to use the “start of group” method, which has more steps. First it determines whether each row starts a new group or not.

select time, amount,
case
  when time - lag(time,1,time-1) over (order by time) > 3/24/60/60
  then 1
  else 0
end grp_start
from t
TIME AMOUNT GRP_START
12:22:01 100 1
12:22:03 200 0
12:22:04 300 0
12:22:06 200 0
12:22:45 100 1
12:22:46 200 0
12:23:12 100 1
12:23:22 200 1

Using the LAG analytic function, I test the time difference between the current row and the previous row (the third argument, time-1, is the value to use on the first row when LAG returns null.) If the difference is more that 3 seconds, the row must start a new group so I assign a value of 1; otherwise I assign 0.

Now here’s the fun part: I use the analytic SUM function on the above result to get a running total. Look what happens:

with a as (
  select time, amount,
  case
    when time - lag(time,1,time-1) over (order by time) > 3/24/60/60
    then 1
    else 0
  end grp_start
  from t
)
select time, amount, grp_start,
sum(grp_start) over (order by time) grp
from a
TIME AMOUNT GRP_START GRP
12:22:01 100 1 1
12:22:03 200 0 1
12:22:04 300 0 1
12:22:06 200 0 1
12:22:45 100 1 2
12:22:46 200 0 2
12:23:12 100 1 3
12:23:22 200 1 4

The running total has given us the group value we wanted. All we have to do now is group by it.

with a as (
  select time, amount,
  case
    when time - lag(time,1,time-1) over (order by time) > 3/24/60/60
    then 1
    else 0
  end grp_start
  from t
), b as (
  select time, amount, grp_start,
  sum(grp_start) over (order by time) grp
  from a
)
select min(time) first_time, max(time) last_time, sum(amount) sum_amount
from b
group by grp
order by 1;
FIRST_TIME LAST_TIME SUM_AMOUNT
12:22:01 12:22:06 800
12:22:45 12:22:46 300
12:23:12 12:23:12 100
12:23:22 12:23:22 200

This method is a bit less efficient than Tabibitosan, because it needs two analytic functions to assign groups to the rows, but it can be used much more often. As long as we can identify the start of each new group using analytics, this method will work.

The “row pattern matching” solution

I have already explained the basic syntax in my previous posts. Here I am taking advantage of the defaults to write the code as concisely as possible.

SELECT * FROM t
MATCH_RECOGNIZE(
  ORDER BY time
  MEASURES FIRST(time) first_time, LAST(time) last_time, SUM(amount) sum_amount
  PATTERN (a b*)
  DEFINE b AS time <= PREV(time) + 3/24/60/60
);

What’s the difference between this solution and the one I used in an earlier post to group consecutive values? Just one: in the DEFINE, I used “=” before and now I use “<=”.

Tabibitosan and “start of group” are two quite different techniques. Before 12c, you have to know both and know when to use each one. With 12c MATCH_RECOGNIZE, you just say what you want and then you get it.

Warning

In the post I linked to, Timur used the “start of group” technique on dba_extents and dba_free_space. Of course I had to redo his code using MATCH_RECOGNIZE. As I write this, my code is still running…wait, it’s not running anymore but the VM with my database just disappeared. I tried again and the VM disappeared again. Let’s be careful out there!

UPDATE: after setting the PROCESSES parameter to something more reasonable, my database and VM are stable but my code runs forever. The trace file shows a wait on ‘resmgr: cpu quantum’.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s