Oracle has been marketing the new
MATCH_RECOGNIZE clause as a pattern matching solution for “Big Data”. I would add that it is a general-purpose tool that belongs in the toolbox of every SQL programmer. My next several blog posts will compare
MATCH_RECOGNIZE to existing techniques; you will see how simple and straightforward it is to use once you get the hang of it.
As an introduction, here is a question from the AskTom site: Ask Tom: Grouping Ranges. Tom actually made this question into an Oracle Magazine article: On Unlearning and Modeling (scroll down to “Grouping Ranges”). The requirement is to order the data by STUDY_SITE and calculate a running total, except that the running total cannot exceed 65,000. If the next row makes us go over 65,000, we need to start over with a new “group”. Here is the data from the original question (see the next page for the CREATE TABLE and INSERT statements):
The desired output is the first and last STUDY_SITE for each group, and the running total:
Tom answered this question using a pipelined table function in PL/SQL. This prompted volunteers to post several answers in SQL:
- there were a handful of attempts using analytic functions, none of which worked;
- plus two solutions by Jichao Li, using either the
MODELclause or recursive subquery factoring. In his article, Tom recommended the
MODELsolution as the most efficient.
You can tell something is hard to do in SQL when Tom Kyte uses PL/SQL and his readers come up with all kinds of solutions, half of which are wrong.
Think “row pattern matching”
Before looking at code, let’s try to think of this problem in terms of what Oracle calls row pattern matching. In plain English, our pattern is “as many consecutive rows as possible, as long as the running total of CNT does not exceed 65,000″. Each set of rows that fits this pattern will be called a match. We want to:
- Order the rows by STUDY_SITE;
- Starting from the first row, gather as many consecutive rows as we can, as long as the running total of CNT does not exceed 65,000. This set of rows is our first match.
- Now we skip past the last row of the preceding match and we start over, looking for the next set of rows that fit the same pattern.
- From each match, return one row with three columns: the STUDY_SITE from the first row, the STUDY_SITE from the last row, and the total CNT.
The syntax of the
MATCH_RECOGNIZE clause is practically the same as the language I just used, except for two things:
- First, it uses the
MEASURESkeyword to identify the list of output columns;
- Next, it uses the
DEFINEkeywords to define the pattern.
The code (at last)
SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY study_site MEASURES FIRST(study_site) first_site, LAST(study_site) last_site, SUM(cnt) sum_cnt ONE ROW PER MATCH -- this is the default AFTER MATCH SKIP PAST LAST ROW -- this is the default PATTERN (A+) DEFINE A AS SUM(cnt) <= 65000 );
I hope you agree that there is not that much difference between my “plain English” description and this code. The only part that should need further explanation is the
PATTERN clause: it is a series of identifiers that follow a “regular expression” syntax. A normal regular expression is a series of characters that you want to find in a text string. If you searched for ‘A+’ you would match one or more consecutive uppercase A’s. Here we are not searching for characters, but rows, so A is just an identifier or placeholder for a rule that we
DEFINE. In this case, the rule is that the running total cannot exceed 65,000.
Compare this code to the
MODEL solution and decide which one is easier to understand:
SELECT s, MAX(e), MAX(sm) FROM ( SELECT s, e, cnt, sm FROM t MODEL DIMENSION BY(row_number() over(order by study_site) rn) MEASURES(study_site s, study_site e, cnt, cnt sm) RULES(sm[rn > 1] = CASE WHEN (sm[cv() - 1] + cnt[cv()]) > 65000 OR cnt[cv()] > 65000 THEN cnt[cv()] ELSE sm[cv() - 1] + cnt[cv()] END, s[rn > 1] = CASE WHEN(sm[cv() - 1] + cnt[cv()]) > 65000 OR cnt[cv()] > 65000 THEN s[cv()] ELSE s[cv() - 1] END)) GROUP BY s ORDER BY s;
After this introduction, there is much to explain about the syntax and the possibilities of the
MATCH_RECOGNIZE clause. My main goal is to show you that this clause belongs in your SQL toolkit, so I’ll continue to compare it to some of the best pre-12c solutions.
(The next page contains the CREATE TABLE and INSERT statements for the test data.)