The Tabibitosan method by Aketi Jyuuzou is a very clever and efficient way to group rows with consecutive values. When it solves the problem, it can’t be beat — unless you have 12c. This method is worth explaining in its own right, so I’ll do my best; then I’ll make it easier with the
A bit of math
Let’s say there are two consecutive integers, A and B, where A = B – 1.
There are two other non-consecutive integers, M and N, so M != N – 1
Now subtract 1 from A and 2 from B. The answer should be the same, since
B – 2 = (B – 1) – 1 = A – 1
(Remember, (B – 1) is equal to A since they are consecutive.)
On the other hand, if we subtract 1 from M and 2 from N, the answer will be different.
Generally, if we subtract one set of consecutive numbers from another set, the answer will always be the same, whereas if we subtract consecutive numbers from non-consecutive numbers the answers will be different. In other words, by subtracting consecutive numbers we have turned “consecutive” into equality and “non-consecutive” into inequality.
A simple example
Here is the data from Aketi’s first example:
create table Ex1 (NumVal) as select 1 from dual union all select 2 from dual union all select 3 from dual union all select 5 from dual union all select 6 from dual union all select 7 from dual union all select 10 from dual union all select 11 from dual union all select 12 from dual union all SELECT 20 FROM dual;
ROW_NUMBER() to get known consecutive numbers, then he does the subtraction:
SELECT NumVal, Row_Number() over(order by NumVal) as rn, NumVal-Row_Number() over(order by NumVal) as "GRP=NUMVAL-RN" FROM Ex1;
The consecutive rows now have the same value for GRP. It doesn’t matter what that value is, because all we want to do is group by it. Now here’s the condensed version of the data, with the first and last value and the number of rows in each group.
select min(NumVal) firstval, max(NumVal) lastval, count(*) cnt FROM ( SELECT NumVal, NumVal-Row_Number() over(order by NumVal) as grp FROM Ex1 ) GROUP BY grp ORDER BY MIN(NumVal);
You can see how efficient this solution is: all it takes is one inline view with an analytic function to get the GRP value, then we do the
GROUP BY and we’re done!
Let’s think of this as a pattern matching problem. The pattern is “consecutive rows”, which we can translate as “the first row, then every row where NumVal is 1 more than the previous NumVal”.
SELECT * FROM ex1 MATCH_RECOGNIZE ( ORDER BY numval MEASURES FIRST(numval) firstval, LAST(numval) lastval, COUNT(*) cnt PATTERN (A b*) DEFINE b as numval = prev(numval)+1 );
I introduced the basic syntax in a recent post, so I’ll just mention the new stuff.
- Here the pattern has two parts, A and B*.
- A all alone means “match exactly one row”. Since A is not defined, the definition defaults to TRUE, meaning match one row no matter what.
- B* means take any and all following rows that fit the definition. The only difference between the ‘+‘ I used before and the ‘*‘ is that ‘+‘ means “at least one row” while ‘*‘ means “zero or more rows”.
- When I define B, I use PREV to refer to the previous row.
The last row, with NumVal = 20, is a match where I found one A and no B. If I had said ‘B+’ that row would be unmatched.
An example with more syntax
An AskTom reader wanted to “partition the range of sequential numbers”. Unlike our first example, he asked to output every line, showing the range on the last line of each match. Here is the test data and expected output.
create table RAND(N) as select 1 from dual union all select 2 from dual union all select 3 from dual union all select 5 from dual union all select 7 from dual union all select 8 from dual union all select 9 from dual union all select 11 from dual;
To get this result, we have to change
MEASURES (for the output columns) and
ROWS PER MATCH (for the output rows).
SELECT * FROM rand MATCH_RECOGNIZE ( ORDER BY n MEASURES CASE WHEN n = FINAL LAST(n) THEN a.n END firstn, CASE WHEN n = FINAL LAST(n) and n != a.n THEN n END lastn ALL ROWS PER MATCH PATTERN (a b*) DEFINE b AS n = prev(n)+1 );
ALL ROWS PER MATCHwill output every row that is involved in a match. Unmatched rows are not returned. (Don’t worry, there is an option to get unmatched rows if you ever need it.)
- Strangely enough, When you ask for ALL ROWS, you get more columns too! Notice I get column N in the output without asking for it in the
- In the
MEASURESclause, “N” means the value of N for the current row. “LAST(N)” means the same thing, since the “last” row is actually the last row so far. If you want to look ahead to the very last row of the match, you need to say
- Notice I say “a.n” instead of FIRST(n). I just wanted to show that the pattern identifiers can be used to qualify rows. “a.n” means the most recent row mapped to A, which is also the first row of the match.
Is MATCH_RECOGNIZE “better” than Tabibitosan?
Tabibitosan is, in my opinion, the most efficient pre-12c pattern matching method, since it finds the matches with one
SELECT using analytic functions. Unfortunately, it solves a limited range of problems and it isn’t that easy to understand and apply properly.
MATCH_RECOGNIZE seems easier to use and, as the next post will show, it also solves problems when Tabibitosan is not quite enough.