12c MATCH_RECOGNIZE: Grouping sequences

The Tabibitosan method by Aketi Jyuuzou is a very clever and efficient way to group rows with consecutive values. When it solves the problem, it can’t be beat — unless you have 12c. This method is worth explaining in its own right, so I’ll do my best; then I’ll make it easier with the MATCH_RECOGNIZE clause.

A bit of math

Let’s say there are two consecutive integers, A and B, where A = B – 1.

There are two other non-consecutive integers, M and N, so M != N – 1

Now subtract 1 from A and 2 from B. The answer should be the same, since
B – 2 = (B – 1) – 1 = A – 1
(Remember, (B – 1) is equal to A since they are consecutive.)

On the other hand, if we subtract 1 from M and 2 from N, the answer will be different.

Generally, if we subtract one set of consecutive numbers from another set, the answer will always be the same, whereas if we subtract consecutive numbers from non-consecutive numbers the answers will be different. In other words, by subtracting consecutive numbers we have turned “consecutive” into equality and “non-consecutive” into inequality.

A simple example

Here is the data from Aketi’s first example:

create table Ex1 (NumVal) as
select  1 from dual union all
select  2 from dual union all
select  3 from dual union all
select  5 from dual union all
select  6 from dual union all
select  7 from dual union all
select 10 from dual union all
select 11 from dual union all
select 12 from dual union all
SELECT 20 FROM dual;

Aketi uses ROW_NUMBER() to get known consecutive numbers, then he does the subtraction:

SELECT NumVal,
Row_Number() over(order by NumVal) as rn,
NumVal-Row_Number() over(order by NumVal) as "GRP=NUMVAL-RN"
FROM Ex1;
NUMVAL RN GRP=NUMVAL-RN
1 1 0
2 2 0
3 3 0
5 4 1
6 5 1
7 6 1
10 7 3
11 8 3
12 9 3
20 10 10

The consecutive rows now have the same value for GRP. It doesn’t matter what that value is, because all we want to do is group by it. Now here’s the condensed version of the data, with the first and last value and the number of rows in each group.

select min(NumVal) firstval, max(NumVal) lastval, count(*) cnt
FROM (
  SELECT NumVal,
  NumVal-Row_Number() over(order by NumVal) as grp
  FROM Ex1
)
GROUP BY grp
ORDER BY MIN(NumVal);
FIRSTVAL LASTVAL CNT
1 3 3
5 7 3
10 12 3
20 20 1

You can see how efficient this solution is: all it takes is one inline view with an analytic function to get the GRP value, then we do the GROUP BY and we’re done!

Using MATCH_RECOGNIZE

Let’s think of this as a pattern matching problem. The pattern is “consecutive rows”, which we can translate as “the first row, then every row where NumVal is 1 more than the previous NumVal”.

SELECT * FROM ex1
MATCH_RECOGNIZE (
  ORDER BY numval
  MEASURES
    FIRST(numval) firstval,
    LAST(numval) lastval,
    COUNT(*) cnt
  PATTERN (A b*)
  DEFINE b as numval = prev(numval)+1
);

I introduced the basic syntax in a recent post, so I’ll just mention the new stuff.

  • Here the pattern has two parts, A and B*.
    • A all alone means “match exactly one row”. Since A is not defined, the definition defaults to TRUE, meaning match one row no matter what.
    • B* means take any and all following rows that fit the definition. The only difference between the ‘+‘ I used before and the ‘*‘ is that ‘+‘ means “at least one row” while ‘*‘ means “zero or more rows”.
  • When I define B, I use PREV to refer to the previous row.

The last row, with NumVal = 20, is a match where I found one A and no B. If I had said ‘B+’ that row would be unmatched.

An example with more syntax

An AskTom reader wanted to “partition the range of sequential numbers”. Unlike our first example, he asked to output every line, showing the range on the last line of each match. Here is the test data and expected output.

create table RAND(N) as
select 1 from dual union all
select 2 from dual union all
select 3 from dual union all
select 5 from dual union all
select 7 from dual union all
select 8 from dual union all
select 9 from dual union all
select 11 from dual;
N FIRSTN LASTN
1
2
3 1 3
5 5
7
8
9 7 9
11 11

To get this result, we have to change MEASURES (for the output columns) and ROWS PER MATCH (for the output rows).

SELECT * FROM rand
MATCH_RECOGNIZE (
  ORDER BY n
  MEASURES
    CASE WHEN n = FINAL LAST(n) THEN a.n END firstn,
    CASE WHEN n = FINAL LAST(n) and n != a.n THEN n END lastn
  ALL ROWS PER MATCH
  PATTERN (a b*)
  DEFINE b AS n = prev(n)+1
);
  • ALL ROWS PER MATCH will output every row that is involved in a match. Unmatched rows are not returned. (Don’t worry, there is an option to get unmatched rows if you ever need it.)
  • Strangely enough, When you ask for ALL ROWS, you get more columns too! Notice I get column N in the output without asking for it in the MEASURES clause.
  • In the MEASURES clause, “N” means the value of N for the current row. “LAST(N)” means the same thing, since the “last” row is actually the last row so far. If you want to look ahead to the very last row of the match, you need to say FINAL LAST.
  • Notice I say “a.n” instead of FIRST(n). I just wanted to show that the pattern identifiers can be used to qualify rows. “a.n” means the most recent row mapped to A, which is also the first row of the match.

Is MATCH_RECOGNIZE “better” than Tabibitosan?

Tabibitosan is, in my opinion, the most efficient pre-12c pattern matching method, since it finds the matches with one SELECT using analytic functions. Unfortunately, it solves a limited range of problems and it isn’t that easy to understand and apply properly. MATCH_RECOGNIZE seems easier to use and, as the next post will show, it also solves problems when Tabibitosan is not quite enough.

Advertisements

2 thoughts on “12c MATCH_RECOGNIZE: Grouping sequences

    • Hi Vijay and thanks for your comment.

      I think there are two steps in solving this kind of problem: first, find an algorithm that solves the problem, and second, implement the algorithm in SQL. Many times in blogs or forums you see the SQL, but no explanation of what algorithm it implements.

      In this case I tried to implement Kadane’s algorithm. I don’t see any way to do this with one single MATCH_RECOGNIZE clause, because the clause doesn’t give you access to the results of previous matches.

      Here is my attempt to implement the algorithm using the MODEL clause:

      with data as (
        select column_value x 
        from table(sys.odcinumberlist(-2, 1, -3, 4, -1, 2, 1, -5, 4))
      )
      select max(max_so_far) max_sum from (
        select * from data
        model
        dimension by (rownum rn)
        measures (x, greatest(x, 0) max_ending_here, greatest(x,0) max_so_far)
        rules (
          max_ending_here[rn>1] order by rn = greatest(0, max_ending_here[cv()-1] + x[cv()]),
          max_so_far[rn>1] order by rn = greatest(max_ending_here[cv()], max_so_far[cv()-1])
        )
      );

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s