Database 12c MATCH_RECOGNIZE: for all sizes of data

Oracle has been marketing the new MATCH_RECOGNIZE clause as a pattern matching solution for “Big Data”. I would add that it is a general-purpose tool that belongs in the toolbox of every SQL programmer. My next several blog posts will compare MATCH_RECOGNIZE to existing techniques; you will see how simple and straightforward it is to use once you get the hang of it.

As an introduction, here is a question from the AskTom site: Ask Tom: Grouping Ranges. Tom actually made this question into an Oracle Magazine article: On Unlearning and Modeling (scroll down to “Grouping Ranges”). The requirement is to order the data by STUDY_SITE and calculate a running total, except that the running total cannot exceed 65,000. If the next row makes us go over 65,000, we need to start over with a new “group”. Here is the data from the original question (see the next page for the CREATE TABLE and INSERT statements):

STUDY_SITE CNT
1001 3407
1002 4323
1004 1623
1008 1991
1011 885
1012 11597
1014 1989
1015 5282
1017 2841
1018 5183
1020 6176
1022 2784
1023 25865
1024 3734
1026 137
1028 6005
1029 76
1031 4599
1032 1989
1034 3427
1036 879
1038 6485
1039 3
1040 1105
1041 6460
1042 968
1044 471
1045 3360

The desired output is the first and last STUDY_SITE for each group, and the running total:

FIRST_SITE LAST_SITE SUM_CNT
1001 1022 48081
1023 1044 62203
1045 1045 3360

Tom answered this question using a pipelined table function in PL/SQL. This prompted volunteers to post several answers in SQL:

  • there were a handful of attempts using analytic functions, none of which worked;
  • plus two solutions by Jichao Li, using either the MODEL clause or recursive subquery factoring. In his article, Tom recommended the MODEL solution as the most efficient.

You can tell something is hard to do in SQL when Tom Kyte uses PL/SQL and his readers come up with all kinds of solutions, half of which are wrong.

Think “row pattern matching”

Before looking at code, let’s try to think of this problem in terms of what Oracle calls row pattern matching. In plain English, our pattern is “as many consecutive rows as possible, as long as the running total of CNT does not exceed 65,000″. Each set of rows that fits this pattern will be called a match. We want to:

  1. Order the rows by STUDY_SITE;
  2. Starting from the first row, gather as many consecutive rows as we can, as long as the running total of CNT does not exceed 65,000. This set of rows is our first match.
  3. Now we skip past the last row of the preceding match and we start over, looking for the next set of rows that fit the same pattern.
  4. From each match, return one row with three columns: the STUDY_SITE from the first row, the STUDY_SITE from the last row, and the total CNT.

The syntax of the MATCH_RECOGNIZE clause is practically the same as the language I just used, except for two things:

  • First, it uses the MEASURES keyword to identify the list of output columns;
  • Next, it uses the PATTERN and DEFINE keywords to define the pattern.

The code (at last)

SELECT * FROM t
MATCH_RECOGNIZE (
  ORDER BY study_site
  MEASURES
    FIRST(study_site) first_site,
    LAST(study_site) last_site,
    SUM(cnt) sum_cnt
  ONE ROW PER MATCH              -- this is the default
  AFTER MATCH SKIP PAST LAST ROW -- this is the default
  PATTERN (A+)
  DEFINE A AS SUM(cnt) <= 65000
);

I hope you agree that there is not that much difference between my “plain English” description and this code. The only part that should need further explanation is the PATTERN clause: it is a series of identifiers that follow a “regular expression” syntax. A normal regular expression is a series of characters that you want to find in a text string. If you searched for ‘A+’ you would match one or more consecutive uppercase A’s. Here we are not searching for characters, but rows, so A is just an identifier or placeholder for a rule that we DEFINE. In this case, the rule is that the running total cannot exceed 65,000.

Compare this code to the MODEL solution and decide which one is easier to understand:

SELECT s, MAX(e), MAX(sm) FROM (
    SELECT s, e, cnt, sm FROM t
     MODEL DIMENSION BY(row_number() over(order by study_site) rn)
           MEASURES(study_site s, study_site e, cnt, cnt sm)
           RULES(sm[rn > 1] =
                   CASE WHEN (sm[cv() - 1] + cnt[cv()]) > 65000 OR cnt[cv()] > 65000
                        THEN cnt[cv()]
                        ELSE sm[cv() - 1] + cnt[cv()]
                    END,
                 s[rn > 1] =
                   CASE WHEN(sm[cv() - 1] + cnt[cv()]) > 65000 OR cnt[cv()] > 65000
                        THEN s[cv()]
                        ELSE s[cv() - 1]
                    END))
  GROUP BY s ORDER BY s; 

What next?

After this introduction, there is much to explain about the syntax and the possibilities of the MATCH_RECOGNIZE clause. My main goal is to show you that this clause belongs in your SQL toolkit, so I’ll continue to compare it to some of the best pre-12c solutions.

(The next page contains the CREATE TABLE and INSERT statements for the test data.)

About these ads

2 thoughts on “Database 12c MATCH_RECOGNIZE: for all sizes of data

  1. Great articles on this new 12c feature. I am sure a lot of people will find these posts really helpful as they explore all the possibilities that MATCH_RECOGNIZE opens up.

    Keith Laker | Senior Principal Product Manager | Data Warehousing and SQL Analytics
    Oracle UK Scotscroft, Towers Business Park, Wilmslow Road, Didsbury. M20 2RY

  2. Stew,

    Would you be interested in doing a joint presentation on SQL pattern matching at this year’s OOW conference? You can contact me directly at keith.laker@oracle.com.

    Keith Laker | Senior Principal Product Manager | Data Warehousing and SQL Analytics
    Oracle UK Scotscroft, Towers Business Park, Wilmslow Road, Didsbury. M20 2RY

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s