Backtracking

By popular request, here are my thoughts about the impact of “backtracking” on performance when using the MATCH_RECOGNIZE clause. This came up again because of a query that Jonathan Lewis wrote recently; however, I will concentrate on the theory, not the query.

Patterns

The match_recognize clause implements row pattern matching: it recognizes that a consecutive series of rows is a match for a defined pattern.

The pattern is described in the aptly named PATTERN clause. The syntax resembles a subset of the regular expression syntax. For example:

  • a regular expression pattern ‘AB’ means “find the character A immediately followed by the character B”.
  • In MATCH_RECOGNIZE, PATTERN(A B) means “find a row that meets condition A, immediately followed by a row that meets condition B”. The conditions are then described in the DEFINE clause.

Both syntaxes use two features that can lead to backtracking: quantifiers and alternation.

Quantifiers

PATTERN(A) means find exactly one row that meets condition A.

We can be more flexible in how many A rows we want:

  • A? means we want 0 or 1 A row;
  • A* means we want 0 or more rows;
  • A+ means we want 1 or more rows;
  • A{100} means we want exactly 100 rows;
  • A{3,100} means we want from 3 to 100 rows.

Notice the word “exactly” appears only once in this list. All the other quantifiers are what I’ll call indefinite: there can be more than one series of rows that match the pattern! Suppose we have 200 consecutive rows that meet condition A: the pattern A* could be met 201 different ways.

When a quantifier is indefinite, the rule is to match as many rows as possible: the quantifiers are greedy. If we add a question mark, the rule is to match as few rows as possible: the quantifier becomes reluctant.

  • A{3,100} will match 100 rows if it can.
  • A{3,100}? will match 3 rows and then stop.

Whether greedy or reluctant, indefinite quantifiers can lead to backtracking. More in a minute.

Alternation

To quote the documentation, an “alternation list is created by placing a vertical bar (|) between each regular expression. Alternatives are preferred in the order they are specified. As an example, PATTERN (A | B | C) attempts to match A first. If A is not matched, it attempts to match B. If B is not matched, it attempts to match C.”

Alternation is also indefinite: in this example, the number of rows is always the same, but the same row might meet any of three different conditions.

From now on I’ll concentrate on quantifiers, since they are much more common.

From “indefinite” to backtracking

Even though an indefinite quantifier means there is more than one correct answer, there is always one preferred answer, so what’s the big deal? Suppose there are 200 rows that meet the A condition:

  • A{1,100} returns 100 rows and
  • A{1,100}? returns 1 row.

Aha! But what if there is another condition after A?

With PATTERN(A{1,100} B), suppose there are 101 consecutive rows that meet A but not B.
A regular expression engine should find 100 A, then not find a B.
It will then backtrack, “giving back” A 100. It will then find there is no B.
It will then backtrack, “giving back” A 99. It will then find there is no B.
And so on all the way back to 1.

With PATTERN(A{1,100}? B), suppose there are 100 consecutive As followed by a B.
The engine should find 1 A, then not find a B.
It will then backtrack, adding A 2. It will then not find a B.
And so on all the way up to 100.

So backtracking does not mean “giving back” an A, it means backing up from B to A.

To summarize: backtracking can happen when an indefinite quantifier is followed by another condition. With greedy quantifiers, the worst backtracking happens when there is no match, because every possible solution must be tested. With reluctant quantifiers, backtracking may happen even when there is eventually a match.

Instrumentation?

There is one bit of instrumentation about backtracking in the explain plan. Here is a quote from Oracle development that Keith Laker sent me five years ago:

In the explain plan the specific pattern matching keywords are: MATCH RECOGNIZE (SORT | BUFFER) [DETERMINISTIC FINITE AUTO]

When the plan shows “MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO“, here “MATCH RECOGNIZE” refers to the row source for evaluating the match_recognize clause , “SORT” means the row source uses “SORT” to sort the data before running it through the state machine to find matches, and “DETERMINISTIC FINITE AUTO” means the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking. DETERMINISTIC FINITE AUTO” is desirable as the execution is efficient when there is no backtracking.

Currently we detect deterministic finite automaton by checking the state machine built: if any state has 2 or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. We don’t check the predicates associated with each transition at all. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.

For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic. We don’t check the mutual exclusiveness of the define predicates in detecting a deterministic state machine.

Conclusions

The quote from Oracle development confirms that alternation, or indefinite quantifiers followed by another condition, are possibly subject to backtracking. If we are lucky enough to see DETERMINISTIC FINITE AUTO, we know backtracking is not a problem. In testing, we should always test situations where no match is found. If there are reluctant quantifiers, we should also test situations where there is a match, but not right away.

Finally, each condition should be defined as strictly as possible, saying what it should be and also what it should not be. More than once, I have run into backtracking problems because the first condition was always true; once I defined the first condition more strictly, potential matches were eliminated much earlier and the query sped up considerably.

Hope this helps,
Stew

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s