Beyond Analytics: MODEL or MATCH_RECOGNIZE

Analytic functions are powerful and efficient, but sometimes they just aren’t enough. When you try analytics and they alone don’t solve the problem, it’s time to think about the MODEL clause – or upgrade to 12c and use MATCH_RECOGNIZE. All three can use partitions and ordering for simpler, more efficient processing. To illustrate, here’s a question from AskTom that Brendan Furey and I answered three different ways > “Group records that are within 6 hours of the first record in the group”.

The requirement

The input contains several records per ID, each record having a datetime called INS_DATE.

drop table GRTAB purge;
create table GRTAB (id number, INS_DATE date);
--
insert into GRTAB values(1,TO_DATE('2011-05-25 23:13:32','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 02:14:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 04:15:30','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 05:14:31','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 07:15:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 10:15:50','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 13:44:46','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 15:14:54','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 16:15:01','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 17:14:38','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 19:15:36','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 11:30:17','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 14:30:22','YYYY-MM-DD HH24:MI:SS'));

For each ID, the first group starts with the earliest record, whose INS_DATE becomes the BASE_DATE for that group. This group contains all the records that are within 6 hours of BASE_DATE. The next group starts with the earliest record after the previous group, whose INS_DATE becomes the BASE_DATE of that group.

ID INS_DATE BASE_DATE
1 2011-05-25 23:13:32 2011-05-25 23:13:32
1 2011-05-26 02:14:19 2011-05-25 23:13:32
1 2011-05-26 04:15:30 2011-05-25 23:13:32
1 2011-05-26 05:14:31 2011-05-26 05:14:31
1 2011-05-26 07:15:19 2011-05-26 05:14:31
1 2011-05-26 10:15:50 2011-05-26 05:14:31
1 2011-05-26 13:44:46 2011-05-26 13:44:46
1 2011-05-26 15:14:54 2011-05-26 13:44:46
1 2011-05-26 16:15:01 2011-05-26 13:44:46
1 2011-05-26 17:14:38 2011-05-26 13:44:46
1 2011-05-26 19:15:36 2011-05-26 13:44:46
2 2011-05-30 11:30:17 2011-05-30 11:30:17
2 2011-05-30 14:30:22 2011-05-30 11:30:17

 

Why use analytics? PARTITION and ORDER BY!

If we know and use analytic functions, we naturally think of using them for this kind of problem. Why? Because the first thing we want to do with the input is PARTITION it by ID, and the next thing we want to do is ORDER each partition by INS_DATE. Partitioning makes processing simpler and more efficient: we only have to think about one ID at a time, and Oracle can independently process the records for each ID. Once we have partitioned, we obviously have to order by INS_DATE.

Why not use analytics?

That is the question. Partitioning and ordering may help solve a problem, but that doesn’t mean analytics are the complete solution. How can we tell when analytics will work and when they won’t?

Analytic functions have input and output: the input is the result of the JOINs, GROUP BYs, etc., and it already contains all the rows of the final result set. An analytic function can use input data from rows before and after the current row, but it cannot access the output from analytics in other rows.

One way around this limitation is to nest analytic functions with subqueries or inline views. The “start of group” method does exactly this: it uses analytics to assign 1 to the first row of a group (and 0 to the others), then it uses the 1s and 0s as input to a SUM() function to generate the group identifiers. Here analytics works because we assign 1 or 0 based only on the input to the function: we don’t need to know whether any previous row got a 1 or a 0.

We cannot solve our current problem with nested analytics because we don’t know how much nesting to do. We would have to nest again every time the BASE_DATE changes!

No, this is a job for what Tom Kyte calls “procedural” processing, even if it can be done in SQL.

Who else does partitioning? The MODEL clause

The MODEL clause, introduced in Oracle 10G, is complex and therefore hard to understand. Still, we can use bits of its functionality to do processing that works like analytics but lets us be “procedural”.

SELECT id, ins_date, base_date
FROM grtab
MODEL
  PARTITION BY (id)
  DIMENSION BY (Row_Number() OVER (PARTITION BY id ORDER BY ins_date) rn)
  MEASURES (ins_date, ins_date base_date)
  RULES (
    base_date[rn > 1] =
      CASE
        WHEN ins_date[cv()] - base_date[cv()-1] > 0.25 THEN ins_date[cv()] 
        ELSE base_date[cv()-1]
      END
  )
  • Without a detailed explanation, we can see that the MODEL clause allows partitioning.
  • Then, instead of an ORDER BY, we use a DIMENSION that orders the data and assigns a sequential subscript to each row.
  • The MEASURES clause adds a BASE_DATE column that is initially populated with the INS_DATE value.
  • After that, our RULE says to go through every row except the first, and to update the BASE_DATE column. The rule says in effect: if my ins_date is more than 6 hours later than the previous row’s base_date, then assign my ins_date to my base_date; otherwise my base_date is the same as the previous row’s base_date.
    By “my” I mean the current row. The MODEL clause refers to the current row using the subscript cv(). Since the subscripts are sequential integers, we can refer to the previous row using cv()-1.

If you look closely, you’ll see that the rule doesn’t make sense unless it is executed in order, from the second row to the last. This is not a problem, since “the order defaults to the order of the columns as specified in the DIMENSION BY clause.”

The key advantage to the MODEL clause over analytics is that it orders not only the data, but the processing. Calculations are done in order, and they can use the results of prior calculations.

12c MATCH_RECOGNIZE partitions and orders by

Once we have Oracle 12c, we can use the MATCH_RECOGNIZE clause to solve problems that require nested analytics, or that analytics can’t solve at all.

SELECT * FROM grtab
MATCH_RECOGNIZE (
  PARTITION BY ID
  ORDER BY ins_date
  MEASURES FIRST(ins_date) base_date
  ALL ROWS PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A+)
  DEFINE A AS ins_date < FIRST(ins_date) + 0.25
)

Here we have our good old PARTITION and ORDER clauses, but unlike analytics they apply to the entire row and not just to one column. The PATTERN and DEFINE clauses unite in one “match” the first row and all following rows that are within 6 hours of the first row.

Then the “magic” happens: when the first “match” is done, the next match starts with the row right after the current match. This is what “after match skip past last row” means. I could have omitted the line because it is the default behavior, but it’s important enough to say explicitly. This instruction lets us “start over” after every match, based on the result of the previous match.

MATCH_RECOGNIZE is a kind of compromise between analytics and procedural processing: we can order the data like analytics, but we can also order the processing and (to some extent) base our logic on prior processing. However, the order of processing is the same as the order of data: we can’t go backwards, skip around or do loops. This type or processing is neither purely “set-based” nor fully “procedural”; I call it “stream-based” processing.

Performance

The execution plans for MODEL and MATCH_RECOGNIZE are pretty cryptic:

-------------------------------------
| Id  | Operation           | Name  |
-------------------------------------
|   0 | SELECT STATEMENT    |       |
|   1 |  SQL MODEL ORDERED  |       |
|   2 |   WINDOW SORT       |       |
|   3 |    TABLE ACCESS FULL| GRTAB |
-------------------------------------
-----------------------------------------------------------------
| Id  | Operation                                       | Name  |
-----------------------------------------------------------------
|   0 | SELECT STATEMENT                                |       |
|   1 |  VIEW                                           |       |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|       |
|   3 |    TABLE ACCESS FULL                            | GRTAB |
-----------------------------------------------------------------

We know that the table was only accessed once, so the I/O was efficient, but we don’t have many details on memory and CPU usage. Using Tom Kyte’s RUNSTATS package on a table with 212992 rows, I found that MATCH_RECOGNIZE ran 20% faster, used 20% less CPU and 40% less memory than MODEL.

Conclusion

Analytic functions let us work with data in other rows than the current row, but we cannot access the results of analytic functions in other rows. We can use the MODEL clause when analytics are not enough: the key is to use ROW_NUMBER() in the DIMENSION clause to assign sequential subscripts to the data and to order the processing. The MATCH_RECOGNIZE clause orders both data and processing and lets us use the results of prior processing, especially to start a new “match” based on where the previous match ended.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s