MATCH_RECOGNIZE Restrictions

Sometimes we think that certain restrictions are not documented when in fact they are. Where do we forget to look? Database Error Messages

As mentioned recently on asktom.oracle.com (thanks, Connor), all error messages and event codes are listed in $ORACLE_HOME/rdbms/mesg/oraus.msg. Searching for “pattern matching”, we read “62500 – 62549 are reserved for PATTERN MATCHING”.

So here is a link to all the MATCH_RECOGNIZE error messages: ORA-60001 to ORA-65535

As of version 19c, the range actually used is 62500 through 62521.

It was a surprise to discover that “bounded quantifiers” are limited to 200, but there it is in black and white:

“ORA-62518: MATCH_RECOGNIZE pattern is too large.”
“Permute with more than 7 elements or bounded quantifier bound greater than 200 are … currently not supported.”

I’m not suggesting we spend our lives reading database error messages, but we have to admit they are documented ;-)

Best regards,
Stew

P.S. Where are the values of UB2MAXVAL and UB4MAXVAL documented? If you know, please post a link in the comments below. I suspect UB2MAXVAL is 0XFFFF and UB4MAXVAL is 0XFFFFFFFF.

FOOTNOTE 2019-11-05 : thanks to Maxim Demenko for pointing out where UB*MAXVAL are documented. In the OCI Programmer’s Guide, at the very end of the chapter on Data Types, it says

“Throughout this guide there are references to data types like ub2 or sb4, or to constants like UB4MAXVAL. These types are defined in the oratypes.h header file, which is found in the public directory. The exact contents may vary according to the operating system that you are using.”

Backtracking

By popular request, here are my thoughts about the impact of “backtracking” on performance when using the MATCH_RECOGNIZE clause. This came up again because of a query that Jonathan Lewis wrote recently; however, I will concentrate on the theory, not the query.

Patterns

The match_recognize clause implements row pattern matching: it recognizes that a consecutive series of rows is a match for a defined pattern.

The pattern is described in the aptly named PATTERN clause. The syntax resembles a subset of the regular expression syntax. For example:

  • a regular expression pattern ‘AB’ means “find the character A immediately followed by the character B”.
  • In MATCH_RECOGNIZE, PATTERN(A B) means “find a row that meets condition A, immediately followed by a row that meets condition B”. The conditions are then described in the DEFINE clause.

Both syntaxes use two features that can lead to backtracking: quantifiers and alternation.

Quantifiers

PATTERN(A) means find exactly one row that meets condition A.

We can be more flexible in how many A rows we want:

  • A? means we want 0 or 1 A row;
  • A* means we want 0 or more rows;
  • A+ means we want 1 or more rows;
  • A{100} means we want exactly 100 rows;
  • A{3,100} means we want from 3 to 100 rows.

Notice the word “exactly” appears only once in this list. All the other quantifiers are what I’ll call indefinite: there can be more than one series of rows that match the pattern! Suppose we have 200 consecutive rows that meet condition A: the pattern A* could be met 201 different ways.

When a quantifier is indefinite, the rule is to match as many rows as possible: the quantifiers are greedy. If we add a question mark, the rule is to match as few rows as possible: the quantifier becomes reluctant.

  • A{3,100} will match 100 rows if it can.
  • A{3,100}? will match 3 rows and then stop.

Whether greedy or reluctant, indefinite quantifiers can lead to backtracking. More in a minute.

Alternation

To quote the documentation, an “alternation list is created by placing a vertical bar (|) between each regular expression. Alternatives are preferred in the order they are specified. As an example, PATTERN (A | B | C) attempts to match A first. If A is not matched, it attempts to match B. If B is not matched, it attempts to match C.”

Alternation is also indefinite: in this example, the number of rows is always the same, but the same row might meet any of three different conditions.

From now on I’ll concentrate on quantifiers, since they are much more common.

From “indefinite” to backtracking

Even though an indefinite quantifier means there is more than one correct answer, there is always one preferred answer, so what’s the big deal? Suppose there are 200 rows that meet the A condition:

  • A{1,100} returns 100 rows and
  • A{1,100}? returns 1 row.

Aha! But what if there is another condition after A?

With PATTERN(A{1,100} B), suppose there are 101 consecutive rows that meet A but not B.
A regular expression engine should find 100 A, then not find a B.
It will then backtrack, “giving back” A 100. It will then find there is no B.
It will then backtrack, “giving back” A 99. It will then find there is no B.
And so on all the way back to 1.

With PATTERN(A{1,100}? B), suppose there are 100 consecutive As followed by a B.
The engine should find 1 A, then not find a B.
It will then backtrack, adding A 2. It will then not find a B.
And so on all the way up to 100.

So backtracking does not mean “giving back” an A, it means backing up from B to A.

To summarize: backtracking can happen when an indefinite quantifier is followed by another condition. With greedy quantifiers, the worst backtracking happens when there is no match, because every possible solution must be tested. With reluctant quantifiers, backtracking may happen even when there is eventually a match.

Instrumentation?

There is one bit of instrumentation about backtracking in the explain plan. Here is a quote from Oracle development that Keith Laker sent me five years ago:

In the explain plan the specific pattern matching keywords are: MATCH RECOGNIZE (SORT | BUFFER) [DETERMINISTIC FINITE AUTO]

When the plan shows “MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO“, here “MATCH RECOGNIZE” refers to the row source for evaluating the match_recognize clause , “SORT” means the row source uses “SORT” to sort the data before running it through the state machine to find matches, and “DETERMINISTIC FINITE AUTO” means the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking. DETERMINISTIC FINITE AUTO” is desirable as the execution is efficient when there is no backtracking.

Currently we detect deterministic finite automaton by checking the state machine built: if any state has 2 or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. We don’t check the predicates associated with each transition at all. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.

For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic. We don’t check the mutual exclusiveness of the define predicates in detecting a deterministic state machine.

Conclusions

The quote from Oracle development confirms that alternation, or indefinite quantifiers followed by another condition, are possibly subject to backtracking. If we are lucky enough to see DETERMINISTIC FINITE AUTO, we know backtracking is not a problem. In testing, we should always test situations where no match is found. If there are reluctant quantifiers, we should also test situations where there is a match, but not right away.

Finally, each condition should be defined as strictly as possible, saying what it should be and also what it should not be. More than once, I have run into backtracking problems because the first condition was always true; once I defined the first condition more strictly, potential matches were eliminated much earlier and the query sped up considerably.

Hope this helps,
Stew

Techniques for Comparing Tables

In my “Advanced Row Pattern Matching” presentation, I demonstrate using MATCH_RECOGNIZE to compare tables. Kim Berg Hansen asked me to compare this technique with others. I did some quick tests and here are the results with some comments.

Technique Seconds
Full join 1
Group by (HASH) 1
Group by (SORT) 1.4
Analytic function 2.5
MATCH_RECOGNIZE 3.7

 

The “Full join” technique only works when we have a primary or unique key that is shared by both tables. I prefer the GROUP BY technique popularized by Tom Kyte, even though it may be a bit slower. When testing, I noticed that the HASH GROUP BY algorithm performs better than SORT GROUP BY, as others have written.

If either of the tables contains duplicate rows (which may happen if we don’t compare all of the columns, or if there is no primary key), then GROUP BY will output one row. This may be a problem if we want data (such as the ROWID)  that was not included in the comparison. In that case, we could use analytic functions or the MATCH_RECOGNIZE clause to compare and output all the rows and columns of interest. As you can see, the analytic function is more than twice as slow but it easily beats the MATCH_RECOGNIZE clause.

I use the output from table comparisons to synchronize the tables, so capturing the ROWID is important to me even when a primary or unique key is not available. For that use case, I will prefer analytic functions from now on.

#OUGN17 Row Pattern Matching with MATCH_RECOGNIZE

OUGN17 (the boat conference) rocked – and the boat didn’t, thank goodness. I gave two talks about row pattern matching that were well received: two or three brave souls even went to both talks.

I put my slides on the OUGN Google Drive page and on Slideshare. I put them in “Powerpoint Show” format, so please let me know if you have any problems with that format.

For those who went to the “advanced” session, I changed a few slides near the end to better illustrate my way of doing “range matches”.

Please download the slides to get the animations!

Thanks to OUGN and to everyone I shared the fun with.

Best regards, Stew

P.S. Here are the Slideshare links. To download, start by clicking on ” in SlideShare ” in the bottom right corner.

#DOAG2016: Ranges, Ranges Everywhere!

Today I did my second presentation at DOAG2016. It was at 9:00 so I got to sleep in ;)

The room was huge but there were enough people that I didn’t feel too lonely.

The room and the technical help were top notch, and again there were questions at just the right time to remind me of things I might have left out!

[Update 2016-11-20: there was a bug on slide 21 ! Instead of “order by end_n” it should be “order by start_n”. I have updated the slide on the DOAG website and on slideshare. My test cases were not thorough; my apologies…]

As promised, I put the presentation on Slideshare. Here is the link:

I will be blogging about some of the content later on, so stay tuned…

As always, please download the slides and play them in Powerpoint so you can see the animations.

Thanks to DOAG for the invite!

#DOAG2016: Advanced Row Pattern Matching

DOAG2016 started today at 8:30, and I did too. There were so many great presentations at the same time as mine, I was surprised and pleased to get a nice audience.

The room and the technical help were top notch, and the questions came at just the right time to remind me of things I might have left out!

As promised, I put the presentation on Slideshare. Here is the link:

http://www.slideshare.net/StewAshton1/advanced-row-pattern-matching

I will be blogging about some of the content later on, so stay tuned…

If you want to start with an “unadvanced” presentation, go here first:

http://www.slideshare.net/StewAshton1/row-pattern-matching-in-oracle-database-12c

As always, please download the slides and play them in Powerpoint so you can see the animations.

Thanks to DOAG for the invite!

Kim Berg Hansen on “Use Cases of Row Pattern Matching in Oracle 12c”

As I write this, I am listening to Kim Berg Hansen explain the MATCH_RECOGNIZE clause. He was kind enough to give me credit for some of the examples and mention this blog.

In addition to my blog posts on the subject, you may enjoy my presentation on SlideShare. Please download it to see the animations!

If you have questions arising from the presentation, please add a comment here.

Bravo to Kim for his remarkably interactive webinar!

Joining Temporal Tables 3: Gaps

This is my third post about temporal tables that don’t allow overlaps. I’m going to add support for temporal gaps, which will also allow objects to stop existing at some point.

Test Data

I’m going to add a NOT NULL numeric field called START_GAP, which can be 0 or 1. When START_GAP = 0, the row contains data that is valid starting at FROM_DATE. When START_GAP = 1, then FROM_DATE is actually the end date of the previous row.

The data can now contain all the relations between time intervals that I listed previously.

alter session set nls_date_format='dd';
drop table a purge;

create table a (
  case_id varchar2(32),
  from_date date,
  start_gap number(1,0) not null check(start_gap in (0,1)),
  a_string varchar2(32),
  primary key(case_id, from_date)
);
Insert into A values ('01-precedes',to_date('2016-01-01','yyyy-mm-dd'),0,'A01-precedes');
Insert into A values ('01-precedes',to_date('2016-01-02','yyyy-mm-dd'),1,'A01-precedes');
Insert into A values ('02-meets',to_date('2016-01-01','yyyy-mm-dd'),0,'A02-meets');
Insert into A values ('02-meets',to_date('2016-01-02','yyyy-mm-dd'),1,'A02-meets');
Insert into A values ('03-overlaps',to_date('2016-01-01','yyyy-mm-dd'),0,'A03-overlaps');
Insert into A values ('03-overlaps',to_date('2016-01-03','yyyy-mm-dd'),1,'A03-overlaps');
Insert into A values ('04-finished by',to_date('2016-01-01','yyyy-mm-dd'),0,'A04-finished by');
Insert into A values ('04-finished by',to_date('2016-01-03','yyyy-mm-dd'),1,'A04-finished by');
Insert into A values ('05-contains',to_date('2016-01-01','yyyy-mm-dd'),0,'A05-contains');
Insert into A values ('05-contains',to_date('2016-01-04','yyyy-mm-dd'),1,'A05-contains');
Insert into A values ('06-starts',to_date('2016-01-01','yyyy-mm-dd'),0,'A06-starts');
Insert into A values ('06-starts',to_date('2016-01-02','yyyy-mm-dd'),1,'A06-starts');
Insert into A values ('07-equals',to_date('2016-01-01','yyyy-mm-dd'),0,'A07-equals');
Insert into A values ('07-equals',to_date('2016-01-02','yyyy-mm-dd'),1,'A07-equals');

drop table b purge;

create table b (
  case_id varchar2(32),
  from_date date,
  start_gap number(1,0) not null check(start_gap in (0,1)),
  b_string varchar2(32),
  primary key(case_id, from_date)
);

Insert into B values ('01-precedes',to_date('2016-01-03','yyyy-mm-dd'),0,'B01-precedes');
Insert into B values ('01-precedes',to_date('2016-01-04','yyyy-mm-dd'),1,'B01-precedes');
Insert into B values ('02-meets',to_date('2016-01-02','yyyy-mm-dd'),0,'B02-meets');
Insert into B values ('02-meets',to_date('2016-01-03','yyyy-mm-dd'),1,'B02-meets');
Insert into B values ('03-overlaps',to_date('2016-01-02','yyyy-mm-dd'),0,'B03-overlaps');
Insert into B values ('03-overlaps',to_date('2016-01-04','yyyy-mm-dd'),1,'B03-overlaps');
Insert into B values ('04-finished by',to_date('2016-01-02','yyyy-mm-dd'),0,'B04-finished by');
Insert into B values ('04-finished by',to_date('2016-01-03','yyyy-mm-dd'),1,'B04-finished by');
Insert into B values ('05-contains',to_date('2016-01-02','yyyy-mm-dd'),0,'B05-contains');
Insert into B values ('05-contains',to_date('2016-01-03','yyyy-mm-dd'),1,'B05-contains');
Insert into B values ('06-starts',to_date('2016-01-01','yyyy-mm-dd'),0,'B06-starts');
Insert into B values ('06-starts',to_date('2016-01-03','yyyy-mm-dd'),1,'B06-starts');
Insert into B values ('07-equals',to_date('2016-01-01','yyyy-mm-dd'),0,'B07-equals');
Insert into B values ('07-equals',to_date('2016-01-02','yyyy-mm-dd'),1,'B07-equals');
commit;

select * from a
union all
select * from b
order by 1,2,4;
CASE_ID FROM_DATE START_GAP A_STRING or B_STRING
01-precedes 01 0 A01-precedes
01-precedes 02 1 A01-precedes
01-precedes 03 0 B01-precedes
01-precedes 04 1 B01-precedes
02-meets 01 0 A02-meets
02-meets 02 1 A02-meets
02-meets 02 0 B02-meets
02-meets 03 1 B02-meets
03-overlaps 01 0 A03-overlaps
03-overlaps 02 0 B03-overlaps
03-overlaps 03 1 A03-overlaps
03-overlaps 04 1 B03-overlaps
04-finished by 01 0 A04-finished by
04-finished by 02 0 B04-finished by
04-finished by 03 1 A04-finished by
04-finished by 03 1 B04-finished by
05-contains 01 0 A05-contains
05-contains 02 0 B05-contains
05-contains 03 1 B05-contains
05-contains 04 1 A05-contains
06-starts 01 0 A06-starts
06-starts 01 0 B06-starts
06-starts 02 1 A06-starts
06-starts 03 1 B06-starts
07-equals 01 0 A07-equals
07-equals 01 0 B07-equals
07-equals 02 1 A07-equals
07-equals 02 1 B07-equals

 

What’s the idea?

As before, we need to carry down values and eliminate certain rows. How do we manage that for the new rows where START_GAP = 1?

  1. For table A, if the most recent START_GAP is 1 then A_STRING should be NULL. Same for B.
  2. If the most recent START_GAP is 1 for both tables, then that row should be eliminated.
  3. If the first row is from A, then B.START_GAP will be NULL. In that case B.START_GAP should be made = 1, since the B object did not exist at that point.
  4. As before, TO_DATE must be NULL or greater than FROM_DATE, otherwise the row is eliminated.

The MATCH_RECOGNIZE solution

Note that I don’t create a ROWTYPE column anymore: since START_GAP is defined as NOT NULL, I can use A.START_GAP to determine whether the current row comes from table A.

select case_id, from_date, to_date, a_str a_string, b_str b_string
from (
  select case_id, from_date, start_gap agap, a_string,
    null bgap, null b_string
  from a
  union all
  select case_id, from_date, null, null, start_gap bgap, b_string
  from b
)
match_recognize(
  partition by case_id order by from_date
  measures next(from_date) to_date,
    nvl(a.agap,1) + nvl(b.bgap,1) abgap,
    case a.agap when 0 then a.a_string end a_str,
    case b.bgap when 0 then b.b_string end b_str
  all rows per match
  pattern( (a|b)+ )
  define a as agap is not null
)
where (to_date > from_date or to_date is null)
  and abgap < 2;
CASE_ID FROM_DATE TO_DATE A_STRING B_STRING
01-precedes 01 02 A01-precedes
01-precedes 03 04 B01-precedes
02-meets 01 02 A02-meets
02-meets 02 03 B02-meets
03-overlaps 01 02 A03-overlaps
03-overlaps 02 03 A03-overlaps B03-overlaps
03-overlaps 03 04 B03-overlaps
04-finished by 01 02 A04-finished by
04-finished by 02 03 A04-finished by B04-finished by
05-contains 01 02 A05-contains
05-contains 02 03 A05-contains B05-contains
05-contains 03 04 A05-contains
06-starts 01 02 A06-starts B06-starts
06-starts 02 03 B06-starts
07-equals 01 02 A07-equals B07-equals

 

The Analytic solution

This one took me some time to figure out. I broke it down into 3 logical steps using WITH clauses.

  • ALL_ROWS just does the UNION ALL, while transforming NULL values to CHR(0)
  • ANALYTIC_ROWS applies all the analytic functions
  • The main SELECT transforms CHR(0) to NULL, makes a value NULL if the most recent START_GAP = 1, and eliminates excess rows.
with all_rows as (
  select case_id, from_date, start_gap agap, nvl(a_string, chr(0)) a_string,
    null bgap, null b_string
  from a
  union all
  select case_id, from_date, null, null,
    start_gap bgap, nvl(b_string, chr(0)) b_string
  from b
)
, analyzed_rows as (
  select case_id, from_date,
    lead(from_date) over(partition by case_id order by from_date) to_date,
    last_value(agap) ignore nulls
      over(partition by case_id order by from_date) agap,
    last_value(bgap) ignore nulls
      over(partition by case_id order by from_date) bgap,
    last_value(a_string) ignore nulls
      over(partition by case_id order by from_date) a_string,
    last_value(b_string) ignore nulls
      over(partition by case_id order by from_date) b_string
  from all_rows
)
select case_id, from_date, to_date,
  case when agap = 0 then nullif(a_string, chr(0)) end a_string,
  case when bgap = 0 then nullif(b_string, chr(0)) end b_string
from analyzed_rows
where (to_date > from_date or to_date is null)
  and nvl(agap,1) + nvl(bgap,1) < 2;

There’s more?

Yes indeed, I’m not through yet.

At this point I can define temporal tables that allow gaps but not overlaps. The attributes can be NULL but FROM_DATE can’t. The TO_DATE is not stored in the same row, but it can be displayed that way. I can also “join” two such tables in an efficient manner.

I only have one problem: your tables probably aren’t built that way. You have FROM_DATE and TO_DATE in each row, and if you say you don’t have overlapping ranges then you want me to believe you.

In my next post I’ll try to “join” tables that store FROM_DATE and TO_DATE, assuming there are no overlaps.