OpenWorld Presentation on Row Pattern Matching

You can download my presentation from

Please download instead of viewing: the slideshare viewer doesn’t show the animations I worked so hard on.

I have already blogged about some of the topics in my presentation. Sometime soon I will continue with in depth looks at other issues, such as “catastrophic backtracking”.

To those many who missed it, I will be making the same presentation at UKOUG Tech 14 in December.

Oracle 12c Row Pattern Matching: Beat the Best Pre-12c Solutions at OpenWorld!

Catchy title, don’t you think? My session has been moved to Monday 4 P.M., in direct conflict with Tom Kyte – and Keith Laker, who asked me to present in the first place.

Avoid the lines: come see the MATCH_RECOGNIZE clause push great pre-12c solutions into retirement. As a bonus, be the first person on your block able to prevent “catastrophic backtracking”.

Click here to see my session in the OpenWorld catalog

Rob van Wijk and ad hoc Grouping

In March 2014, I wrote about two methods for ad hoc grouping: “Grouping Sequences” and “Start of Group“. I just found out that Rob van Wijk wrote on the same subjects two months earlier. We even used the same example!

I mention Rob’s article for two reasons:

  • If my explanation wasn’t clear enough, you can try Rob’s.
  • When I use “Start of Group”, Rob uses a similar method he calls “max-on-case-row-number”. These techniques are completely equivalent. Execution plans and performance are identical.

The “Start of Group” method can take many forms. Don’t think you need to know every one! Just choose the form you understand and like best, and use it all the time.

Don’t forget, when you get 12c you can replace both methods by the MATCH_RECOGNIZE clause :)

SQL and date ranges: don’t make NULL mean something

We often use NULL values in date ranges, usually in the end date to signify the indefinite future. I have decided that this is a bad idea; here are my reasons.

[Update 2014-07-04] There have been some great replies to this post. I am going to address some of their points.

1) NULL should not mean anything

According to the SQL standard, NULL means “unknown”. That is why you can’t really compare a NULL value to anything, you can only say if it IS NULL or not.

If you use NULL in date ranges, it doesn’t just mean something, it means two different things! In the start date it means “from the beginning of time”, and in the end date it means “‘until the end of time”. This in not only contrary to the “official” meaning of NULL, it is confusing.

[Update] Kevan Gelling points out that there may really be “unknown” values, in which case NULL should be allowed. I agree, as long as NULL is reserved for that use.

Jeffrey Kemp says: ‘In most cases an End Date of NULL means “we don’t know (yet) what the end date will be – or even if it will ever end”.’ Jeffrey, you should add “we don’t even know if it has already ended”! Your phrase taken alone implies that the end date is in the future, not in the past. This makes NULL meaningful.

2) We can use real date limits instead of NULL

One argument for using NULL is that it means the earliest possible start date, or the latest possible end date. We don’t need NULL for that! In Oracle, the earliest possible date is 4712-01-01 B.C., or DATE '-4712-01-01'. The latest possible date is DATE '9999-12-31' (you can add the time element 23:59:59 if you want to be a purist.)

To enforce this, I suggest declaring the start and end date columns as NOT NULL with default values. In Database 12c, I would use the DEFAULT ON NULL clause: this clause puts the default value in the column even if you explicitly try to put a NULL there.

If you want the output to show NULL instead of these default values, you do have to use something like NULLIF() in your SELECT clause.

[Update] Again, Kevan Gelling argues that NULL may be necessary to indicate a value that is not known. In that case my suggestion is no good, but you still have to use something other than NULL to indicate “beginning of time” and “end of time”.

Several posters bring up the fact that artificial extreme dates will “skew” the data and make it hard for the optimizer to choose the best plan. They are right, I should mention this. However, the data is skewed whether I use extreme dates or NULL. Supposing I do use NULL to mean “the end of time”, many queries will have to use END_DATE IS NULL in the WHERE clause, so the skew is there no matter what.

3) Oracle doesn’t index NULLs

When you query data with date ranges, you often have to check “greater than” one value and “less than” another value. This may require separate indexes on start date and end date. If you have NULLs in your date ranges, those indexes will not be used since Oracle doesn’t create index entries when all the indexed columns are NULL.

If you use real values and NOT NULL, your indexes will always work.

[Update] Some readers apparently didn’t see the phrase that I have now underlined. The header was probably misleading: Oracle will indeed index a row if any of the indexed columns is not NULL.

Some objected that indexes on dates alone are rarely used: date ranges are almost always applied to some object, so the index will include the object and one or both dates. In that case, the rows with NULL dates will be indexed. I agree. In that situation, NULLs don’t cause a problem, and “skew” won’t either, as long as the object comes before the date in the index column list.

4) Queries on NULLable date ranges are hard

I have blogged about various date range problems: finding gaps, finding overlaps and merging contiguous ranges. Almost always, my solutions worked fine without NULLs and broke when I introduced NULLs. Handling NULLs required either more complex queries or substituting real values for NULL. Why go to all that work when we can just use the real values to begin with?

[Update] Even if we need NULL values sometimes to mean “unknown value”, we still don’t use IS NULL in our queries, so indexes should work when needed.

5) Unique constraints don’t work with NULL

Most queries on date ranges, including mine, assume that start dates are unique. In some cases, it may also be useful to have unique end dates. The only way to make sure they are unique is to add a unique constraint. Unfortunately, uniqueness is only enforced for NOT NULL values.

Most of the time, your constraint will be on two columns: some object and a date. In this case the unique constraint will work as long as the object column is NOT NULL. However, anytime you need unique dates throughout a table you must define them as NOT NULL and use default values.

[Update] As mentioned under paragraph 3), this point is only valid for unique constraints on the date alone. This is probably a rare case. Also, as Kevan reminded me, you could always use a function-based index on (DATE_COL, ‘X’) to make sure every row was indexed.

[Update] Conclusion: NULL should mean “unknown” only

As readers have pointed out, my arguments 3) and 5) are weak, since they only apply to indexes or constraints on a date column alone, and those are rarely needed.

I’ll stand by arguments 1), 2) and 4), but thanks to Kevan I’ll add one thing: just because I don’t use NULL to mean “beginning or end of time”, that doesn’t mean I might not need it to mean “unknown value”. In that case, default values may not be the way to go.

Finally, please read the replies: good stuff, and thanks to all!

SQL to find World Cup matches with comebacks

Lucas Jellema is writing a great series on SQL applied to the World Cup of football / soccer. In the article about finding matches with comebacks, he challenged readers to “find ‘dramatic comebacks’ where a team was two goals behind at some stage and yet managed to win the game.” Here is my reply, since replying directly on his blog was too challenging for me!

Here is a copy of Lucas’ DDL and test data. I added three rows to his data, one with a draw, one with a scoreless draw and one with a “dramatic comeback”.

drop table match_results;
   (    "GROUP1" VARCHAR2(1 BYTE), 
    "HOME" NUMBER(1,0), 
    "AWAY" NUMBER(1,0), 
    "HOME_GOALS" NUMBER(2,0), 
    "AWAY_GOALS" NUMBER(2,0), 
    "SDM_ID" NUMBER(2,0), 
    "ID" NUMBER(2,0) 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 
Insert into MATCH_RESULTS values 

The SCORING_PROCESS column shows who scored in what order: for example ‘100’ means one team scored the first goal, then the other team scored 2 goals.

Who won?

Conceptually, the first thing I do is determine who won the game, by counting the number of ‘1’ and the number of ‘0’. When the ‘1’s are ahead, I say the winner is 1; when there are more ‘0’s, I say the winner is -1. For draws, the “winner” is 0; I then remove draws in my WHERE clause.

select id, scoring_process, 
   - length(replace(scoring_process||'01','1',''))) 
from match_results
where sign(length(replace(scoring_process||'01','0','')) 
         - length(replace(scoring_process||'01','1',''))) 
  != 0;
1 1000 -1
2 0 -1
17 0 -1
18 1111 1
3 011111 1
4 0010 -1
19 11 1
20 10011 1
23 00011111 1

How far behind did the winner get?

For each row, I go through the SCORING_PROCESS column , turning ‘1’ into the number 1 and ‘0’ into the number -1. This makes it easy to use the SUM() analytic function to calculate who led by how much after each goal. Here is an example using a hard-coded value.

select sum(
) over (order by level)
from dual 
connect by level <= length('00011111');

In this case, I know the winner was team 1, so I can see that they came back from a 3-goal deficit. Putting it all together, here is a solution that shows all the comebacks and how far behind the winning team got.

select id, scoring_process, 
max(-winner*column_value) max_behind 
from ( 
  select id, scoring_process, 
     - length(replace(scoring_process||'01','1',''))) 
  from match_results
  where sign(length(replace(scoring_process||'01','0','')) 
           - length(replace(scoring_process||'01','1',''))) 
    != 0
) mr, 
  select sum( 
  ) over (order by level) 
  from dual 
  connect by level <= length(mr.scoring_process)
) as sys.odcinumberlist)) l
group by id, scoring_process
having max(-winner*column_value) > 0;
23 00011111 3
3 011111 1
1 1000 1
20 10011 1

Bin Fitting problems with SQL

“Bin Fitting” or “Bin Packing” means putting the greatest quantity in the smallest number of “bins” or containers. There are some good solutions with the MODEL clause, but the most concise and efficient solutions use the new 12c MATCH_RECOGNIZE clause.

There are two categories of “bin fitting” problems:

  1. The number of bins is fixed and the capacity can vary.
    > We want to equalize the quantities we put into each bin.
  2. The bins have a fixed capacity and the number of bins can vary.
    > We want to minimize the number of bins.

I’ll show an example of each.

Fill a fixed number of bins as equally as possible

Wikipedia calls this the “Multiprocessor scheduling problem“. Brendan Furey, author of a brilliant MODEL solution, calls it “balanced number partitioning“. Let’s take a simple, reproducible example:

SELECT LEVEL item_name, level item_value 
from dual
connect by level <= 10;

The total value of all the items is 55. We want to split the items into 3 “bins” or “buckets” or groups, so that the sums of the values are as equal as possible. Here the best we can do is put 19 in one bin and 18 in each of the other bins.

There are several algorithms that solve this problem more or less accurately. The one I’m going to show is fairly simple:

  • First, order the items by value in descending order
  • Then, assign each item to whatever bin has the smallest sum so far.

Here is Brendan’s MODEL implementation:

SELECT item_name, bin, item_value,
Max (bin_value) OVER (PARTITION BY bin) bin_value
  SELECT * FROM items
  DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn)
    Row_Number() OVER (ORDER BY item_value DESC) bin,
    item_value bin_value,
    Row_Number() OVER (ORDER BY item_value DESC) rn_m,
    0 min_bin,
    Count(*) OVER () - :N_BINS - 1 n_iters
  RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) (
    min_bin[1] =
      Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn <= :N_BINS],
    bin[ITERATION_NUMBER + :N_BINS + 1] = 
    bin_value[min_bin[1]] =
      bin_value[CV()] + Nvl (item_value[ITERATION_NUMBER + :N_BINS + 1], 0)
ORDER BY item_value DESC;
10 1 10 19
9 2 9 18
8 3 8 18
7 3 7 18
6 2 6 18
5 1 5 19
4 1 4 19
3 2 3 18
2 3 2 18
1 3 1 18

Please refer to Brendan’s post for a full explanation, but notice the number of “bins” is actually a bind variable!

Unfortunately, this is not a very scalable solution: with 10,000 records, this query runs in about 30 seconds. My implementation with the 12C MATCH_RECOGNIZE clause has to hard-code the number of buckets, but it does 10,000 records in .2 seconds!

SELECT * from  items
  ORDER BY item_value desc
  MEASURES to_number(substr(classifier(),4)) bin#,
    sum(bin1.item_value) bin1,
    sum(bin2.item_value) bin2,
    sum(bin3.item_value) bin3
  PATTERN ((bin1|bin2|bin3)*)
  DEFINE bin1 AS count(bin1.*) = 1
    OR sum(bin1.item_value)-bin1.item_value
        <= least(sum(bin2.item_value), sum(bin3.item_value)),
    bin2 AS count(bin2.*) = 1
      OR sum(bin2.item_value)-bin2.item_value
        <= sum(bin3.item_value)
10 10 1 10
9 9 2 10 9
8 8 3 10 9 8
7 7 3 10 9 15
6 6 2 10 15 15
5 5 1 15 15 15
4 4 1 19 15 15
3 3 2 19 18 15
2 2 3 19 18 17
1 1 3 19 18 18

I’ve already mentioned the basic syntax of this clause in previous posts, but look at the PATTERN line. The vertical pipe means “OR” – I want a line that matches either “bin1″, “bin2″ or bin3″, and the first match wins.

The DEFINE clause takes some getting used to. The way it works is, you assume that the current row matches the first condition and you test the condition; if the condition is true then there really is a match and you move on to the next row. Otherwise you test the second condition, and so one until the last. In this case, I don’t have to define BIN3 at all: it’s always true if none of the previous matches work.

The BIN1 condition starts with count(bin1.*) = 1: this can only be true if no previous row has been assigned to BIN1. The other part of the condition compares the sums of BIN1, BIN2 and BIN3. If BIN1 has the smallest amount, it gets the current row. Now here’s the tricky part: if I just say sum(bin1.item_value), then that sum will include the row I am testing. What I really want is the sum so far, not including the row I am testing, so I have to subtract that value.

The BIN2 condition is simpler, because I already know that BIN1 contains more that BIN2.

Bins have a fixed capacity: use as few bins as possible

My first post on the MATCH_RECOGNIZE clause showed how to calculate running totals that start over before the total exceeds a certain value. This is a simple version of the “fixed capacity” problem.

I’m going to complicate things a bit by letting each bin have an elastic capacity. The normal capacity is 50, but in some cases I can go over the limit of 50, as long as I never go over 55.

  • If the current bin already has 50 or more, it is full and cannot receive a new item.
  • If the current bin has less than 50, I can add a new item as long as the new total is 55 or less.

I admit that I was hoping to solve this problem by using more advanced syntax in the PATTERN clause, such as “reluctant qualifiers”. I tried 4 or 5 different solutions and they all had bugs. The only solution that seems to work uses a simple PATTERN and a slightly more complex DEFINE. Anyway, here’s the code to create a test table with different boundary conditions.

drop table t cascade constraints purge;
create table t(cat_id number, item_id number, num_units number, incr number);

insert into t
select rownum cat_id, rownum item_id, num_units, incr from (
  select level incr from dual
  connect by level <= 6
) a
, lateral(
  select 51 - level num_units from dual
  connect by level <= a.incr+1
insert into t
select cat_id, rownum+100 item_id, incr num_units, incr from t
, lateral(
  select null n from dual
  connect by level <= (56 - t.num_units)/t.incr+1
insert into t values(100, 200, 41, 1);

select cat_id, item_id, num_units,
incr "Increment"
from t
where num_units > 40
order by cat_id, item_id;
1 1 50 1
2 2 49 1
3 3 50 2
4 4 49 2
5 5 48 2
6 6 50 3
7 7 49 3
8 8 48 3
9 9 47 3
10 10 50 4
11 11 49 4
12 12 48 4
13 13 47 4
14 14 46 4
15 15 50 5
16 16 49 5
17 17 48 5
18 18 47 5
19 19 46 5
20 20 45 5
21 21 50 6
22 22 49 6
23 23 48 6
24 24 47 6
25 25 46 6
26 26 45 6
27 27 44 6
100 200 41 1

Each category is a different test case. The first item in each category has a value between 44 and 50. The following items all have the same values, indicated in the “Increment” column. I add a last case at the end where I don’t reach the limit of 50.

Notice the use of the LATERAL clause, which is now documented in 12c. It’s a lot simpler than the old TABLE(CAST(MULTISET syntax. Now here is the solution; I only show the first bin in each category, plus the first item in the second bin. I’ve got nothing more to say, so don’t scroll to the bottom if you don’t feel like it :)

select cat_id, bin#, sum_units, num_units, item_id
from t
  partition by cat_id
  order by item_id
  measures final sum(num_units) sum_units,
  match_number() bin#, count(*) cnt
  all rows per match
  define A as sum(num_units) - num_units < 50
    and sum(num_units) <= 55
where bin# = 1 or cnt = 1;
1 1 50 50 1
1 2 7 1 101
2 1 50 49 2
2 1 50 1 108
2 2 7 1 109
3 1 50 50 3
3 2 8 2 116
4 1 51 49 4
4 1 51 2 120
4 2 6 2 121
5 1 50 48 5
5 1 50 2 124
5 2 8 2 125
6 1 50 50 6
6 2 9 3 129
7 1 52 49 7
7 1 52 3 132
7 2 6 3 133
8 1 51 48 8
8 1 51 3 135
8 2 6 3 136
9 1 50 47 9
9 1 50 3 138
9 2 9 3 139
10 1 50 50 10
10 2 8 4 142
11 1 53 49 11
11 1 53 4 144
11 2 4 4 145
12 1 52 48 12
12 1 52 4 146
12 2 8 4 147
13 1 51 47 13
13 1 51 4 149
13 2 8 4 150
14 1 50 46 14
14 1 50 4 152
14 2 8 4 153
15 1 50 50 15
15 2 10 5 155
16 1 54 49 16
16 1 54 5 157
16 2 5 5 158
17 1 53 48 17
17 1 53 5 159
17 2 5 5 160
18 1 52 47 18
18 1 52 5 161
18 2 5 5 162
19 1 51 46 19
19 1 51 5 163
19 2 10 5 164
20 1 50 45 20
20 1 50 5 166
20 2 10 5 167
21 1 50 50 21
21 2 12 6 169
22 1 55 49 22
22 1 55 6 171
22 2 6 6 172
23 1 54 48 23
23 1 54 6 173
23 2 6 6 174
24 1 53 47 24
24 1 53 6 175
24 2 6 6 176
25 1 52 46 25
25 1 52 6 177
25 2 6 6 178
26 1 51 45 26
26 1 51 6 179
26 2 6 6 180
27 1 50 44 27
27 1 50 6 181
27 2 12 6 182
100 1 41 41 200

Beyond Analytics: MODEL or MATCH_RECOGNIZE

Analytic functions are powerful and efficient, but sometimes they just aren’t enough. When you try analytics and they alone don’t solve the problem, it’s time to think about the MODEL clause – or upgrade to 12c and use MATCH_RECOGNIZE. All three can use partitions and ordering for simpler, more efficient processing. To illustrate, here’s a question from AskTom that Brendan Furey and I answered three different ways > “Group records that are within 6 hours of the first record in the group”.

The requirement

The input contains several records per ID, each record having a datetime called INS_DATE.

drop table GRTAB purge;
create table GRTAB (id number, INS_DATE date);
insert into GRTAB values(1,TO_DATE('2011-05-25 23:13:32','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 02:14:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 04:15:30','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 05:14:31','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 07:15:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 10:15:50','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 13:44:46','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 15:14:54','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 16:15:01','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 17:14:38','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 19:15:36','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 11:30:17','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 14:30:22','YYYY-MM-DD HH24:MI:SS'));

For each ID, the first group starts with the earliest record, whose INS_DATE becomes the BASE_DATE for that group. This group contains all the records that are within 6 hours of BASE_DATE. The next group starts with the earliest record after the previous group, whose INS_DATE becomes the BASE_DATE of that group.

1 2011-05-25 23:13:32 2011-05-25 23:13:32
1 2011-05-26 02:14:19 2011-05-25 23:13:32
1 2011-05-26 04:15:30 2011-05-25 23:13:32
1 2011-05-26 05:14:31 2011-05-26 05:14:31
1 2011-05-26 07:15:19 2011-05-26 05:14:31
1 2011-05-26 10:15:50 2011-05-26 05:14:31
1 2011-05-26 13:44:46 2011-05-26 13:44:46
1 2011-05-26 15:14:54 2011-05-26 13:44:46
1 2011-05-26 16:15:01 2011-05-26 13:44:46
1 2011-05-26 17:14:38 2011-05-26 13:44:46
1 2011-05-26 19:15:36 2011-05-26 13:44:46
2 2011-05-30 11:30:17 2011-05-30 11:30:17
2 2011-05-30 14:30:22 2011-05-30 11:30:17


Why use analytics? PARTITION and ORDER BY!

If we know and use analytic functions, we naturally think of using them for this kind of problem. Why? Because the first thing we want to do with the input is PARTITION it by ID, and the next thing we want to do is ORDER each partition by INS_DATE. Partitioning makes processing simpler and more efficient: we only have to think about one ID at a time, and Oracle can independently process the records for each ID. Once we have partitioned, we obviously have to order by INS_DATE.

Why not use analytics?

That is the question. Partitioning and ordering may help solve a problem, but that doesn’t mean analytics are the complete solution. How can we tell when analytics will work and when they won’t?

Analytic functions have input and output: the input is the result of the JOINs, GROUP BYs, etc., and it already contains all the rows of the final result set. An analytic function can use input data from rows before and after the current row, but it cannot access the output from analytics in other rows.

One way around this limitation is to nest analytic functions with subqueries or inline views. The “start of group” method does exactly this: it uses analytics to assign 1 to the first row of a group (and 0 to the others), then it uses the 1s and 0s as input to a SUM() function to generate the group identifiers. Here analytics works because we assign 1 or 0 based only on the input to the function: we don’t need to know whether any previous row got a 1 or a 0.

We cannot solve our current problem with nested analytics because we don’t know how much nesting to do. We would have to nest again every time the BASE_DATE changes!

No, this is a job for what Tom Kyte calls “procedural” processing, even if it can be done in SQL.

Who else does partitioning? The MODEL clause

The MODEL clause, introduced in Oracle 10G, is complex and therefore hard to understand. Still, we can use bits of its functionality to do processing that works like analytics but lets us be “procedural”.

SELECT id, ins_date, base_date
FROM grtab
  DIMENSION BY (Row_Number() OVER (PARTITION BY id ORDER BY ins_date) rn)
  MEASURES (ins_date, ins_date base_date)
    base_date[rn > 1] =
        WHEN ins_date[cv()] - base_date[cv()-1] > 0.25 THEN ins_date[cv()] 
        ELSE base_date[cv()-1]
  • Without a detailed explanation, we can see that the MODEL clause allows partitioning.
  • Then, instead of an ORDER BY, we use a DIMENSION that orders the data and assigns a sequential subscript to each row.
  • The MEASURES clause adds a BASE_DATE column that is initially populated with the INS_DATE value.
  • After that, our RULE says to go through every row except the first, and to update the BASE_DATE column. The rule says in effect: if my ins_date is more than 6 hours later than the previous row’s base_date, then assign my ins_date to my base_date; otherwise my base_date is the same as the previous row’s base_date.
    By “my” I mean the current row. The MODEL clause refers to the current row using the subscript cv(). Since the subscripts are sequential integers, we can refer to the previous row using cv()-1.

If you look closely, you’ll see that the rule doesn’t make sense unless it is executed in order, from the second row to the last. This is not a problem, since “the order defaults to the order of the columns as specified in the DIMENSION BY clause.”

The key advantage to the MODEL clause over analytics is that it orders not only the data, but the processing. Calculations are done in order, and they can use the results of prior calculations.

12c MATCH_RECOGNIZE partitions and orders by

Once we have Oracle 12c, we can use the MATCH_RECOGNIZE clause to solve problems that require nested analytics, or that analytics can’t solve at all.

  ORDER BY ins_date
  MEASURES FIRST(ins_date) base_date
  DEFINE A AS ins_date < FIRST(ins_date) + 0.25

Here we have our good old PARTITION and ORDER clauses, but unlike analytics they apply to the entire row and not just to one column. The PATTERN and DEFINE clauses unite in one “match” the first row and all following rows that are within 6 hours of the first row.

Then the “magic” happens: when the first “match” is done, the next match starts with the row right after the current match. This is what “after match skip past last row” means. I could have omitted the line because it is the default behavior, but it’s important enough to say explicitly. This instruction lets us “start over” after every match, based on the result of the previous match.

MATCH_RECOGNIZE is a kind of compromise between analytics and procedural processing: we can order the data like analytics, but we can also order the processing and (to some extent) base our logic on prior processing. However, the order of processing is the same as the order of data: we can’t go backwards, skip around or do loops. This type or processing is neither purely “set-based” nor fully “procedural”; I call it “stream-based” processing.


The execution plans for MODEL and MATCH_RECOGNIZE are pretty cryptic:

| Id  | Operation           | Name  |
|   0 | SELECT STATEMENT    |       |
|   1 |  SQL MODEL ORDERED  |       |
|   2 |   WINDOW SORT       |       |
| Id  | Operation                                       | Name  |
|   0 | SELECT STATEMENT                                |       |
|   1 |  VIEW                                           |       |
|   3 |    TABLE ACCESS FULL                            | GRTAB |

We know that the table was only accessed once, so the I/O was efficient, but we don’t have many details on memory and CPU usage. Using Tom Kyte’s RUNSTATS package on a table with 212992 rows, I found that MATCH_RECOGNIZE ran 20% faster, used 20% less CPU and 40% less memory than MODEL.


Analytic functions let us work with data in other rows than the current row, but we cannot access the results of analytic functions in other rows. We can use the MODEL clause when analytics are not enough: the key is to use ROW_NUMBER() in the DIMENSION clause to assign sequential subscripts to the data and to order the processing. The MATCH_RECOGNIZE clause orders both data and processing and lets us use the results of prior processing, especially to start a new “match” based on where the previous match ended.