Rob van Wijk and ad hoc Grouping

In March 2014, I wrote about two methods for ad hoc grouping: “Grouping Sequences” and “Start of Group“. I just found out that Rob van Wijk wrote on the same subjects two months earlier. We even used the same example!

I mention Rob’s article for two reasons:

  • If my explanation wasn’t clear enough, you can try Rob’s.
  • When I use “Start of Group”, Rob uses a similar method he calls “max-on-case-row-number”. These techniques are completely equivalent. Execution plans and performance are identical.

The “Start of Group” method can take many forms. Don’t think you need to know every one! Just choose the form you understand and like best, and use it all the time.

Don’t forget, when you get 12c you can replace both methods by the MATCH_RECOGNIZE clause :)

SQL and date ranges: don’t make NULL mean something

We often use NULL values in date ranges, usually in the end date to signify the indefinite future. I have decided that this is a bad idea; here are my reasons.

[Update 2014-07-04] There have been some great replies to this post. I am going to address some of their points.

1) NULL should not mean anything

According to the SQL standard, NULL means “unknown”. That is why you can’t really compare a NULL value to anything, you can only say if it IS NULL or not.

If you use NULL in date ranges, it doesn’t just mean something, it means two different things! In the start date it means “from the beginning of time”, and in the end date it means “‘until the end of time”. This in not only contrary to the “official” meaning of NULL, it is confusing.

[Update] Kevan Gelling points out that there may really be “unknown” values, in which case NULL should be allowed. I agree, as long as NULL is reserved for that use.

Jeffrey Kemp says: ‘In most cases an End Date of NULL means “we don’t know (yet) what the end date will be – or even if it will ever end”.’ Jeffrey, you should add “we don’t even know if it has already ended”! Your phrase taken alone implies that the end date is in the future, not in the past. This makes NULL meaningful.

2) We can use real date limits instead of NULL

One argument for using NULL is that it means the earliest possible start date, or the latest possible end date. We don’t need NULL for that! In Oracle, the earliest possible date is 4712-01-01 B.C., or DATE '-4712-01-01'. The latest possible date is DATE '9999-12-31' (you can add the time element 23:59:59 if you want to be a purist.)

To enforce this, I suggest declaring the start and end date columns as NOT NULL with default values. In Database 12c, I would use the DEFAULT ON NULL clause: this clause puts the default value in the column even if you explicitly try to put a NULL there.

If you want the output to show NULL instead of these default values, you do have to use something like NULLIF() in your SELECT clause.

[Update] Again, Kevan Gelling argues that NULL may be necessary to indicate a value that is not known. In that case my suggestion is no good, but you still have to use something other than NULL to indicate “beginning of time” and “end of time”.

Several posters bring up the fact that artificial extreme dates will “skew” the data and make it hard for the optimizer to choose the best plan. They are right, I should mention this. However, the data is skewed whether I use extreme dates or NULL. Supposing I do use NULL to mean “the end of time”, many queries will have to use END_DATE IS NULL in the WHERE clause, so the skew is there no matter what.

3) Oracle doesn’t index NULLs

When you query data with date ranges, you often have to check “greater than” one value and “less than” another value. This may require separate indexes on start date and end date. If you have NULLs in your date ranges, those indexes will not be used since Oracle doesn’t create index entries when all the indexed columns are NULL.

If you use real values and NOT NULL, your indexes will always work.

[Update] Some readers apparently didn’t see the phrase that I have now underlined. The header was probably misleading: Oracle will indeed index a row if any of the indexed columns is not NULL.

Some objected that indexes on dates alone are rarely used: date ranges are almost always applied to some object, so the index will include the object and one or both dates. In that case, the rows with NULL dates will be indexed. I agree. In that situation, NULLs don’t cause a problem, and “skew” won’t either, as long as the object comes before the date in the index column list.

4) Queries on NULLable date ranges are hard

I have blogged about various date range problems: finding gaps, finding overlaps and merging contiguous ranges. Almost always, my solutions worked fine without NULLs and broke when I introduced NULLs. Handling NULLs required either more complex queries or substituting real values for NULL. Why go to all that work when we can just use the real values to begin with?

[Update] Even if we need NULL values sometimes to mean “unknown value”, we still don’t use IS NULL in our queries, so indexes should work when needed.

5) Unique constraints don’t work with NULL

Most queries on date ranges, including mine, assume that start dates are unique. In some cases, it may also be useful to have unique end dates. The only way to make sure they are unique is to add a unique constraint. Unfortunately, uniqueness is only enforced for NOT NULL values.

Most of the time, your constraint will be on two columns: some object and a date. In this case the unique constraint will work as long as the object column is NOT NULL. However, anytime you need unique dates throughout a table you must define them as NOT NULL and use default values.

[Update] As mentioned under paragraph 3), this point is only valid for unique constraints on the date alone. This is probably a rare case. Also, as Kevan reminded me, you could always use a function-based index on (DATE_COL, ‘X’) to make sure every row was indexed.

[Update] Conclusion: NULL should mean “unknown” only

As readers have pointed out, my arguments 3) and 5) are weak, since they only apply to indexes or constraints on a date column alone, and those are rarely needed.

I’ll stand by arguments 1), 2) and 4), but thanks to Kevan I’ll add one thing: just because I don’t use NULL to mean “beginning or end of time”, that doesn’t mean I might not need it to mean “unknown value”. In that case, default values may not be the way to go.

Finally, please read the replies: good stuff, and thanks to all!

SQL to find World Cup matches with comebacks

Lucas Jellema is writing a great series on SQL applied to the World Cup of football / soccer. In the article about finding matches with comebacks, he challenged readers to “find ‘dramatic comebacks’ where a team was two goals behind at some stage and yet managed to win the game.” Here is my reply, since replying directly on his blog was too challenging for me!

Here is a copy of Lucas’ DDL and test data. I added three rows to his data, one with a draw, one with a scoreless draw and one with a “dramatic comeback”.

drop table match_results;
CREATE TABLE "MATCH_RESULTS" 
   (    "GROUP1" VARCHAR2(1 BYTE), 
    "HOME" NUMBER(1,0), 
    "AWAY" NUMBER(1,0), 
    "HOME_GOALS" NUMBER(2,0), 
    "AWAY_GOALS" NUMBER(2,0), 
    "SDM_ID" NUMBER(2,0), 
    "LOCAL_START_TIME" DATE, 
    "SCORING_PROCESS" VARCHAR2(20 BYTE), 
    "WEATHER_CATEGORY" VARCHAR2(20 BYTE), 
    "ID" NUMBER(2,0) 
   ); 
Insert into MATCH_RESULTS values 
  ('A',1,2,3,1,5,to_date('12-JUN-14','DD-MON-RR'),'1000',null,1); 
Insert into MATCH_RESULTS values
  ('A',3,4,1,0,3,to_date('13-JUN-14','DD-MON-RR'),'0',null,2); 
Insert into MATCH_RESULTS values 
  ('A',1,3,0,0,6,to_date('17-JUN-14','DD-MON-RR'),'0',null,17); 
Insert into MATCH_RESULTS values 
  ('A',4,2,0,4,2,to_date('18-JUN-14','DD-MON-RR'),'1111',null,18); 
Insert into MATCH_RESULTS values 
  ('B',1,2,1,5,12,to_date('13-JUN-14','DD-MON-RR'),'011111',null,3); 
Insert into MATCH_RESULTS values 
  ('B',3,4,3,1,7,to_date('13-JUN-14','DD-MON-RR'),'0010',null,4); 
Insert into MATCH_RESULTS values 
  ('B',1,3,0,2,1,to_date('18-JUN-14','DD-MON-RR'),'11',null,19); 
Insert into MATCH_RESULTS values 
  ('B',4,2,2,3,10,to_date('18-JUN-14','DD-MON-RR'),'10011',null,20); 
Insert into MATCH_RESULTS values 
  ('B',4,2,2,3,10,to_date('18-JUN-14','DD-MON-RR'),'1001',null,21); 
Insert into MATCH_RESULTS values 
  ('B',4,2,2,3,10,to_date('18-JUN-14','DD-MON-RR'),null,null,22); 
Insert into MATCH_RESULTS values 
  ('B',4,2,2,3,10,to_date('18-JUN-14','DD-MON-RR'),'00011111',null,23); 
commit;

The SCORING_PROCESS column shows who scored in what order: for example ‘100’ means one team scored the first goal, then the other team scored 2 goals.

Who won?

Conceptually, the first thing I do is determine who won the game, by counting the number of ‘1’ and the number of ‘0’. When the ‘1’s are ahead, I say the winner is 1; when there are more ‘0’s, I say the winner is -1. For draws, the “winner” is 0; I then remove draws in my WHERE clause.

select id, scoring_process, 
sign(length(replace(scoring_process||'01','0','')) 
   - length(replace(scoring_process||'01','1',''))) 
winner 
from match_results
where sign(length(replace(scoring_process||'01','0','')) 
         - length(replace(scoring_process||'01','1',''))) 
  != 0;
ID SCORING_PROCESS WINNER
1 1000 -1
2 0 -1
17 0 -1
18 1111 1
3 011111 1
4 0010 -1
19 11 1
20 10011 1
23 00011111 1

How far behind did the winner get?

For each row, I go through the SCORING_PROCESS column , turning ‘1’ into the number 1 and ‘0’ into the number -1. This makes it easy to use the SUM() analytic function to calculate who led by how much after each goal. Here is an example using a hard-coded value.

select sum(
  decode(to_number(substr('00011111',level,1)),0,-1,1) 
) over (order by level)
cum_score
from dual 
connect by level <= length('00011111');
CUM_SCORE
-1
-2
-3
-2
-1
0
1
2

In this case, I know the winner was team 1, so I can see that they came back from a 3-goal deficit. Putting it all together, here is a solution that shows all the comebacks and how far behind the winning team got.

select id, scoring_process, 
max(-winner*column_value) max_behind 
from ( 
  select id, scoring_process, 
  sign(length(replace(scoring_process||'01','0','')) 
     - length(replace(scoring_process||'01','1',''))) 
  winner 
  from match_results
  where sign(length(replace(scoring_process||'01','0','')) 
           - length(replace(scoring_process||'01','1',''))) 
    != 0
) mr, 
table(cast(multiset(
  select sum( 
    decode(to_number(substr(mr.scoring_process,level,1)),0,-1,1) 
  ) over (order by level) 
  from dual 
  connect by level <= length(mr.scoring_process)
) as sys.odcinumberlist)) l
group by id, scoring_process
having max(-winner*column_value) > 0;
ID SCORING_PROCESS MAX_BEHIND
23 00011111 3
3 011111 1
1 1000 1
20 10011 1

Bin Fitting problems with SQL

“Bin Fitting” or “Bin Packing” means putting the greatest quantity in the smallest number of “bins” or containers. There are some good solutions with the MODEL clause, but the most concise and efficient solutions use the new 12c MATCH_RECOGNIZE clause.

There are two categories of “bin fitting” problems:

  1. The number of bins is fixed and the capacity can vary.
    > We want to equalize the quantities we put into each bin.
  2. The bins have a fixed capacity and the number of bins can vary.
    > We want to minimize the number of bins.

I’ll show an example of each.

Fill a fixed number of bins as equally as possible

Wikipedia calls this the “Multiprocessor scheduling problem“. Brendan Furey, author of a brilliant MODEL solution, calls it “balanced number partitioning“. Let’s take a simple, reproducible example:

CREATE TABLE items AS
SELECT LEVEL item_name, level item_value 
from dual
connect by level <= 10;

The total value of all the items is 55. We want to split the items into 3 “bins” or “buckets” or groups, so that the sums of the values are as equal as possible. Here the best we can do is put 19 in one bin and 18 in each of the other bins.

There are several algorithms that solve this problem more or less accurately. The one I’m going to show is fairly simple:

  • First, order the items by value in descending order
  • Then, assign each item to whatever bin has the smallest sum so far.

Here is Brendan’s MODEL implementation:

SELECT item_name, bin, item_value,
Max (bin_value) OVER (PARTITION BY bin) bin_value
FROM (
  SELECT * FROM items
  MODEL 
  DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn)
  MEASURES (
    item_name, 
    item_value,
    Row_Number() OVER (ORDER BY item_value DESC) bin,
    item_value bin_value,
    Row_Number() OVER (ORDER BY item_value DESC) rn_m,
    0 min_bin,
    Count(*) OVER () - :N_BINS - 1 n_iters
  )
  RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) (
    min_bin[1] =
      Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn <= :N_BINS],
    bin[ITERATION_NUMBER + :N_BINS + 1] = 
      min_bin[1],
    bin_value[min_bin[1]] =
      bin_value[CV()] + Nvl (item_value[ITERATION_NUMBER + :N_BINS + 1], 0)
  )
)
WHERE item_name IS NOT NULL
ORDER BY item_value DESC;
ITEM_NAME BIN ITEM_VALUE BIN_VALUE
10 1 10 19
9 2 9 18
8 3 8 18
7 3 7 18
6 2 6 18
5 1 5 19
4 1 4 19
3 2 3 18
2 3 2 18
1 3 1 18

Please refer to Brendan’s post for a full explanation, but notice the number of “bins” is actually a bind variable!

Unfortunately, this is not a very scalable solution: with 10,000 records, this query runs in about 30 seconds. My implementation with the 12C MATCH_RECOGNIZE clause has to hard-code the number of buckets, but it does 10,000 records in .2 seconds!

SELECT * from  items
MATCH_RECOGNIZE (
  ORDER BY item_value desc
  MEASURES to_number(substr(classifier(),4)) bin#,
    sum(bin1.item_value) bin1,
    sum(bin2.item_value) bin2,
    sum(bin3.item_value) bin3
  ALL ROWS PER MATCH
  PATTERN ((bin1|bin2|bin3)*)
  DEFINE bin1 AS count(bin1.*) = 1
    OR sum(bin1.item_value)-bin1.item_value
        <= least(sum(bin2.item_value), sum(bin3.item_value)),
    bin2 AS count(bin2.*) = 1
      OR sum(bin2.item_value)-bin2.item_value
        <= sum(bin3.item_value)
);
ITEM_NAME ITEM_VALUE BIN# BIN1_CUM_VAL BIN2_CUM_VAL BIN3_CUM_VAL
10 10 1 10
9 9 2 10 9
8 8 3 10 9 8
7 7 3 10 9 15
6 6 2 10 15 15
5 5 1 15 15 15
4 4 1 19 15 15
3 3 2 19 18 15
2 2 3 19 18 17
1 1 3 19 18 18

I’ve already mentioned the basic syntax of this clause in previous posts, but look at the PATTERN line. The vertical pipe means “OR” – I want a line that matches either “bin1″, “bin2″ or bin3″, and the first match wins.

The DEFINE clause takes some getting used to. The way it works is, you assume that the current row matches the first condition and you test the condition; if the condition is true then there really is a match and you move on to the next row. Otherwise you test the second condition, and so one until the last. In this case, I don’t have to define BIN3 at all: it’s always true if none of the previous matches work.

The BIN1 condition starts with count(bin1.*) = 1: this can only be true if no previous row has been assigned to BIN1. The other part of the condition compares the sums of BIN1, BIN2 and BIN3. If BIN1 has the smallest amount, it gets the current row. Now here’s the tricky part: if I just say sum(bin1.item_value), then that sum will include the row I am testing. What I really want is the sum so far, not including the row I am testing, so I have to subtract that value.

The BIN2 condition is simpler, because I already know that BIN1 contains more that BIN2.

Bins have a fixed capacity: use as few bins as possible

My first post on the MATCH_RECOGNIZE clause showed how to calculate running totals that start over before the total exceeds a certain value. This is a simple version of the “fixed capacity” problem.

I’m going to complicate things a bit by letting each bin have an elastic capacity. The normal capacity is 50, but in some cases I can go over the limit of 50, as long as I never go over 55.

  • If the current bin already has 50 or more, it is full and cannot receive a new item.
  • If the current bin has less than 50, I can add a new item as long as the new total is 55 or less.

I admit that I was hoping to solve this problem by using more advanced syntax in the PATTERN clause, such as “reluctant qualifiers”. I tried 4 or 5 different solutions and they all had bugs. The only solution that seems to work uses a simple PATTERN and a slightly more complex DEFINE. Anyway, here’s the code to create a test table with different boundary conditions.

drop table t cascade constraints purge;
create table t(cat_id number, item_id number, num_units number, incr number);

insert into t
select rownum cat_id, rownum item_id, num_units, incr from (
  select level incr from dual
  connect by level <= 6
) a
, lateral(
  select 51 - level num_units from dual
  connect by level <= a.incr+1
);
insert into t
select cat_id, rownum+100 item_id, incr num_units, incr from t
, lateral(
  select null n from dual
  connect by level <= (56 - t.num_units)/t.incr+1
);
insert into t values(100, 200, 41, 1);

select cat_id, item_id, num_units,
incr "Increment"
from t
where num_units > 40
order by cat_id, item_id;
CAT_ID ITEM_ID NUM_UNITS Increment
1 1 50 1
2 2 49 1
3 3 50 2
4 4 49 2
5 5 48 2
6 6 50 3
7 7 49 3
8 8 48 3
9 9 47 3
10 10 50 4
11 11 49 4
12 12 48 4
13 13 47 4
14 14 46 4
15 15 50 5
16 16 49 5
17 17 48 5
18 18 47 5
19 19 46 5
20 20 45 5
21 21 50 6
22 22 49 6
23 23 48 6
24 24 47 6
25 25 46 6
26 26 45 6
27 27 44 6
100 200 41 1

Each category is a different test case. The first item in each category has a value between 44 and 50. The following items all have the same values, indicated in the “Increment” column. I add a last case at the end where I don’t reach the limit of 50.

Notice the use of the LATERAL clause, which is now documented in 12c. It’s a lot simpler than the old TABLE(CAST(MULTISET syntax. Now here is the solution; I only show the first bin in each category, plus the first item in the second bin. I’ve got nothing more to say, so don’t scroll to the bottom if you don’t feel like it :)

select cat_id, bin#, sum_units, num_units, item_id
from t
match_recognize(
  partition by cat_id
  order by item_id
  measures final sum(num_units) sum_units,
  match_number() bin#, count(*) cnt
  all rows per match
  pattern(A+)
  define A as sum(num_units) - num_units < 50
    and sum(num_units) <= 55
)
where bin# = 1 or cnt = 1;
CAT_ID BIN# SUM_UNITS NUM_UNITS ITEM_ID
1 1 50 50 1
1 2 7 1 101
2 1 50 49 2
2 1 50 1 108
2 2 7 1 109
3 1 50 50 3
3 2 8 2 116
4 1 51 49 4
4 1 51 2 120
4 2 6 2 121
5 1 50 48 5
5 1 50 2 124
5 2 8 2 125
6 1 50 50 6
6 2 9 3 129
7 1 52 49 7
7 1 52 3 132
7 2 6 3 133
8 1 51 48 8
8 1 51 3 135
8 2 6 3 136
9 1 50 47 9
9 1 50 3 138
9 2 9 3 139
10 1 50 50 10
10 2 8 4 142
11 1 53 49 11
11 1 53 4 144
11 2 4 4 145
12 1 52 48 12
12 1 52 4 146
12 2 8 4 147
13 1 51 47 13
13 1 51 4 149
13 2 8 4 150
14 1 50 46 14
14 1 50 4 152
14 2 8 4 153
15 1 50 50 15
15 2 10 5 155
16 1 54 49 16
16 1 54 5 157
16 2 5 5 158
17 1 53 48 17
17 1 53 5 159
17 2 5 5 160
18 1 52 47 18
18 1 52 5 161
18 2 5 5 162
19 1 51 46 19
19 1 51 5 163
19 2 10 5 164
20 1 50 45 20
20 1 50 5 166
20 2 10 5 167
21 1 50 50 21
21 2 12 6 169
22 1 55 49 22
22 1 55 6 171
22 2 6 6 172
23 1 54 48 23
23 1 54 6 173
23 2 6 6 174
24 1 53 47 24
24 1 53 6 175
24 2 6 6 176
25 1 52 46 25
25 1 52 6 177
25 2 6 6 178
26 1 51 45 26
26 1 51 6 179
26 2 6 6 180
27 1 50 44 27
27 1 50 6 181
27 2 12 6 182
100 1 41 41 200

Beyond Analytics: MODEL or MATCH_RECOGNIZE

Analytic functions are powerful and efficient, but sometimes they just aren’t enough. When you try analytics and they alone don’t solve the problem, it’s time to think about the MODEL clause – or upgrade to 12c and use MATCH_RECOGNIZE. All three can use partitions and ordering for simpler, more efficient processing. To illustrate, here’s a question from AskTom that Brendan Furey and I answered three different ways > “Group records that are within 6 hours of the first record in the group”.

The requirement

The input contains several records per ID, each record having a datetime called INS_DATE.

drop table GRTAB purge;
create table GRTAB (id number, INS_DATE date);
--
insert into GRTAB values(1,TO_DATE('2011-05-25 23:13:32','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 02:14:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 04:15:30','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 05:14:31','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 07:15:19','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 10:15:50','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 13:44:46','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 15:14:54','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 16:15:01','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 17:14:38','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(1,TO_DATE('2011-05-26 19:15:36','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 11:30:17','YYYY-MM-DD HH24:MI:SS'));
insert into GRTAB values(2,TO_DATE('2011-05-30 14:30:22','YYYY-MM-DD HH24:MI:SS'));

For each ID, the first group starts with the earliest record, whose INS_DATE becomes the BASE_DATE for that group. This group contains all the records that are within 6 hours of BASE_DATE. The next group starts with the earliest record after the previous group, whose INS_DATE becomes the BASE_DATE of that group.

ID INS_DATE BASE_DATE
1 2011-05-25 23:13:32 2011-05-25 23:13:32
1 2011-05-26 02:14:19 2011-05-25 23:13:32
1 2011-05-26 04:15:30 2011-05-25 23:13:32
1 2011-05-26 05:14:31 2011-05-26 05:14:31
1 2011-05-26 07:15:19 2011-05-26 05:14:31
1 2011-05-26 10:15:50 2011-05-26 05:14:31
1 2011-05-26 13:44:46 2011-05-26 13:44:46
1 2011-05-26 15:14:54 2011-05-26 13:44:46
1 2011-05-26 16:15:01 2011-05-26 13:44:46
1 2011-05-26 17:14:38 2011-05-26 13:44:46
1 2011-05-26 19:15:36 2011-05-26 13:44:46
2 2011-05-30 11:30:17 2011-05-30 11:30:17
2 2011-05-30 14:30:22 2011-05-30 11:30:17

 

Why use analytics? PARTITION and ORDER BY!

If we know and use analytic functions, we naturally think of using them for this kind of problem. Why? Because the first thing we want to do with the input is PARTITION it by ID, and the next thing we want to do is ORDER each partition by INS_DATE. Partitioning makes processing simpler and more efficient: we only have to think about one ID at a time, and Oracle can independently process the records for each ID. Once we have partitioned, we obviously have to order by INS_DATE.

Why not use analytics?

That is the question. Partitioning and ordering may help solve a problem, but that doesn’t mean analytics are the complete solution. How can we tell when analytics will work and when they won’t?

Analytic functions have input and output: the input is the result of the JOINs, GROUP BYs, etc., and it already contains all the rows of the final result set. An analytic function can use input data from rows before and after the current row, but it cannot access the output from analytics in other rows.

One way around this limitation is to nest analytic functions with subqueries or inline views. The “start of group” method does exactly this: it uses analytics to assign 1 to the first row of a group (and 0 to the others), then it uses the 1s and 0s as input to a SUM() function to generate the group identifiers. Here analytics works because we assign 1 or 0 based only on the input to the function: we don’t need to know whether any previous row got a 1 or a 0.

We cannot solve our current problem with nested analytics because we don’t know how much nesting to do. We would have to nest again every time the BASE_DATE changes!

No, this is a job for what Tom Kyte calls “procedural” processing, even if it can be done in SQL.

Who else does partitioning? The MODEL clause

The MODEL clause, introduced in Oracle 10G, is complex and therefore hard to understand. Still, we can use bits of its functionality to do processing that works like analytics but lets us be “procedural”.

SELECT id, ins_date, base_date
FROM grtab
MODEL
  PARTITION BY (id)
  DIMENSION BY (Row_Number() OVER (PARTITION BY id ORDER BY ins_date) rn)
  MEASURES (ins_date, ins_date base_date)
  RULES (
    base_date[rn > 1] =
      CASE
        WHEN ins_date[cv()] - base_date[cv()-1] > 0.25 THEN ins_date[cv()] 
        ELSE base_date[cv()-1]
      END
  )
  • Without a detailed explanation, we can see that the MODEL clause allows partitioning.
  • Then, instead of an ORDER BY, we use a DIMENSION that orders the data and assigns a sequential subscript to each row.
  • The MEASURES clause adds a BASE_DATE column that is initially populated with the INS_DATE value.
  • After that, our RULE says to go through every row except the first, and to update the BASE_DATE column. The rule says in effect: if my ins_date is more than 6 hours later than the previous row’s base_date, then assign my ins_date to my base_date; otherwise my base_date is the same as the previous row’s base_date.
    By “my” I mean the current row. The MODEL clause refers to the current row using the subscript cv(). Since the subscripts are sequential integers, we can refer to the previous row using cv()-1.

If you look closely, you’ll see that the rule doesn’t make sense unless it is executed in order, from the second row to the last. This is not a problem, since “the order defaults to the order of the columns as specified in the DIMENSION BY clause.”

The key advantage to the MODEL clause over analytics is that it orders not only the data, but the processing. Calculations are done in order, and they can use the results of prior calculations.

12c MATCH_RECOGNIZE partitions and orders by

Once we have Oracle 12c, we can use the MATCH_RECOGNIZE clause to solve problems that require nested analytics, or that analytics can’t solve at all.

SELECT * FROM grtab
MATCH_RECOGNIZE (
  PARTITION BY ID
  ORDER BY ins_date
  MEASURES FIRST(ins_date) base_date
  ALL ROWS PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A+)
  DEFINE A AS ins_date < FIRST(ins_date) + 0.25
)

Here we have our good old PARTITION and ORDER clauses, but unlike analytics they apply to the entire row and not just to one column. The PATTERN and DEFINE clauses unite in one “match” the first row and all following rows that are within 6 hours of the first row.

Then the “magic” happens: when the first “match” is done, the next match starts with the row right after the current match. This is what “after match skip past last row” means. I could have omitted the line because it is the default behavior, but it’s important enough to say explicitly. This instruction lets us “start over” after every match, based on the result of the previous match.

MATCH_RECOGNIZE is a kind of compromise between analytics and procedural processing: we can order the data like analytics, but we can also order the processing and (to some extent) base our logic on prior processing. However, the order of processing is the same as the order of data: we can’t go backwards, skip around or do loops. This type or processing is neither purely “set-based” nor fully “procedural”; I call it “stream-based” processing.

Performance

The execution plans for MODEL and MATCH_RECOGNIZE are pretty cryptic:

-------------------------------------
| Id  | Operation           | Name  |
-------------------------------------
|   0 | SELECT STATEMENT    |       |
|   1 |  SQL MODEL ORDERED  |       |
|   2 |   WINDOW SORT       |       |
|   3 |    TABLE ACCESS FULL| GRTAB |
-------------------------------------
-----------------------------------------------------------------
| Id  | Operation                                       | Name  |
-----------------------------------------------------------------
|   0 | SELECT STATEMENT                                |       |
|   1 |  VIEW                                           |       |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|       |
|   3 |    TABLE ACCESS FULL                            | GRTAB |
-----------------------------------------------------------------

We know that the table was only accessed once, so the I/O was efficient, but we don’t have many details on memory and CPU usage. Using Tom Kyte’s RUNSTATS package on a table with 212992 rows, I found that MATCH_RECOGNIZE ran 20% faster, used 20% less CPU and 40% less memory than MODEL.

Conclusion

Analytic functions let us work with data in other rows than the current row, but we cannot access the results of analytic functions in other rows. We can use the MODEL clause when analytics are not enough: the key is to use ROW_NUMBER() in the DIMENSION clause to assign sequential subscripts to the data and to order the processing. The MATCH_RECOGNIZE clause orders both data and processing and lets us use the results of prior processing, especially to start a new “match” based on where the previous match ended.

Down with Firefox

Warning: this post is not technical and it is not about Oracle.

Brendan Eich recently resigned under pressure from his job as CEO of Mozilla, the makers of Firefox. The reason given was a campaign contribution that Mr. Eich made in 2008.

The State of California has the “referendum”: a proposition is submitted to a yes or no vote, and if accepted it becomes law. This is pure democracy.

In 2008, Mr. Eich made a contribution to a group that opposed a proposition. In 2014, a group favorable to the proposition started a boycott of Firefox when Mr. Eich was promoted to CEO. Mozilla succumbed to pressure and asked Mr. Eich to resign, which he finally agreed to do.

It doesn’t matter what the proposition was about. This was an exercise in democracy and Mr. Eich was participating in the debate as a private individual (although he named his employer as required by law). If he had contributed to the opposite side, would he have been asked to leave? (Laughter)

Mr. Eich lost his job because he exercised his individual liberty on the “wrong side”. In protest, I am personally boycotting Firefox. This post and all further posts will be written using Google Chrome.

If you think that freedom of democratic debate should be limited to those who agree with you, then feel free to boycott this blog.

Overlapping ranges with priority

A few years ago, Alberto Dell’Era blogged about product prices with overlapping date ranges; “on an overlapping range, the strongest priority (lower number) wins.” His analysis, explanation and solution are all excellent. I’m revisiting the subject to suggest a few improvements (and mention MATCH_RECOGNIZE yet again).

To explain more easily, Alberto uses letters (a,b,c,d,e) instead of dates, and in the code he changes the letters to numbers using the ascii() function. (The logic is the same, whether the ranges are dates or something else.) He uses the term sku as a product identifier. “Prices of skus (i.e. goods) have validity ranges and can overlap; on an overlapping range, the strongest priority (lower number) wins.” Sample input:

b______(0,$200)_______d
b__________c______(1,$300)_______e

Notice that d and e are not colored, because they are not included in the range. Since priority 0 is stronger than 1, the expected output is:

b_______($200)________d
b__________c__________d__($300)__e

The Algorithm

Step Expected Output
1) Get distinct start and end points b,c,d,e
2) Create non-overlapping base ranges (Alberto has a not null end point, but in general nulls may be useful.) b__________c
b__________c__________d
b__________c__________d__________e
b__________c__________d__________e_(null)
3) Assign price/priority to base ranges b_(0,$200)_c
b__________c_(0,$200)_d
b__________c_(1,$300)_d
b__________c__________d_(1,$300)_e
4) For each base range, keep highest priority b__($200)__c
b__________c__($200)__d
b__________c__________d__($300)__e
5) Merge consecutive ranges with same price b_______($200)________d
b__________c__________d__($300)__e

1) Get distinct start and end points

Alberto uses UNION to get the start points, then the end points.

with instants as (
  select sku, a as i from ranges
  union
  select sku, b as i from ranges
),

This requires two scans of the index. I suggest using UNPIVOT, which does the work with one scan. I’ll show you that next.

2) Create non-overlapping base ranges

Alberto does this with the analytic LEAD function:

base_ranges as (
  select * from (
    select sku,
    i as ba,
    lead(i) over (partition by sku order by i) as bb
    from instants
  )
  where bb is not null
),

I do the same thing, but thanks to UNPIVOT I can do steps 1) and 2) with one SELECT:

with base_ranges as (
  select distinct sku, i ba,
  lead(i) over(partition by sku order by i, col) bb,
  dense_rank() over(partition by sku order by i) seq
  from ranges
  unpivot(i for col in(a,b))
),

Notice I add the column SEQ, which assigns consecutive integers to each date range. SEQ will allow me to use the Tabibitosan method in step 5) !

3) Assign price/priority to base ranges

We both do this by joining the base ranges to the original rows. Here’s Alberto’s code:

factored_ranges as (
  select i.sku, bi.ba, bi.bb, i.a, i.b, i.prio, i.price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba and bi.ba < i.b)  
),

4) For each base range, keep highest priority

Alberto uses the DENSE_RANK analytic function on each base range, giving a rank of 1 to the highest priority, then keeps only the rows where the rank is 1.

strongest_factored_ranges as (
  select sku, ba, bb, prio, price
  from (
    select sku, ba, bb, prio, price,
    dense_rank () over (partition by sku, ba, bb order by prio) as rnk
    from factored_ranges
  )
  where rnk = 1
),

I decided to use the KEEP (DENSE_RANK FIRST...) aggregate function. I can then combine the aggregation with another analytic function in the same SELECT. I’ll show this in a minute.

5) Merge consecutive ranges with same price

Alberto uses an equivalent to the “start of group” method to merge the prices.

-- STEP will be zero if a range can be joined to the previous one, since:
-- a) they are consecutive (no gap between them)
-- b) they have the same price
-- see http://www.oracle.com/technetwork/issue-archive/o24asktom-095715.html
ranges_with_step as (
  select sku, ba, bb, prio, price,
  decode (
    price,
    lag(price) over (partition by sku order by ba),
    ba - lag(bb) over (partition by sku order by ba),
    1000
  ) step
  from strongest_factored_ranges
), 
-- the integral of step over the ranges
-- joinable ranges will hence have the same value for "interval" since step is zero
ranges_with_step_integral as (
  select sku, ba, bb, prio, price, step,
  sum(step) over (
    partition by sku order by ba
    rows between unbounded preceding and current row
  ) as integral
  from ranges_with_step
), 
-- this joins the joinable ranges
ranges_joined as (
  select * from (
    select sku, ba, bb, prio, price, step, integral,
    min(ba) over (partition by sku, integral) as a,
    max(bb) over (partition by sku, integral) as b
    from ranges_with_step_integral
  )
  where step > 0 
)
select sku, chr(a) "START", chr(b) "END", price
from ranges_joined;

I’ve waited until now to show my code, because I do steps 3), 4) and part of 5) in the same SELECT.

strongest_factored_ranges as (
  select bi.sku,
  bi.seq -
    row_number() over(partition by bi.sku order by min(i.prio), bi.seq) grp,
  bi.ba a, bi.bb b, 
  min(i.price) keep (dense_rank first order by i.prio) price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba
  and (bi.bb <= i.b or bi.bb is null and i.b is null))
  group by bi.sku, bi.seq, bi.ba, bi.bb
)
select sku, chr(min(a)) "START", chr(max(b)) "END", price
from strongest_factored_ranges
group by sku, grp, price;
  • I changed the WHERE clause to allow for NULL end points, just to show how.
  • The KEEP (DENSE_RANK FIRST...) aggregate function gives me the price of the highest priority for each base range.
  • Once the aggregation is done, I can use an analytic function, and even refer to the result of the aggregation within the analytic. Here is where I use the Tabibitosan method to assign the same grp to consecutive ranges with the same price.

Using 12c MATCH_RECOGNIZE

With 12c row pattern matching, I don’t need the Tabibitosan method so I can simplify the preceding subqueries and use fewer analytic functions. Here is the entire solution:

-- steps 1) and 2)
with base_ranges as (
  select distinct sku, i ba,
  lead(i) over(partition by sku order by i, col) bb
  from ranges
  unpivot(i for col in(a,b))
)
-- steps 3) and 4)
, strongest_factored_ranges as (
  select bi.sku,
  bi.ba a, bi.bb b, 
  min(i.price) keep (dense_rank first order by i.prio) price
  from ranges i, base_ranges bi
  where i.sku = bi.sku
  and (i.a <= bi.ba
  and (bi.bb <= i.b or bi.bb is null and i.b is null))
  group by bi.sku, bi.ba, bi.bb
)
-- step 5)
select * from strongest_factored_ranges
match_recognize(
  partition by sku order by a, b
  measures chr(first(a)) "START", chr(last(b)) "END",
    first(price) price
  pattern(A B*)
  define B as (a, price) = ((prev(b), a.price))
);

Performance

You may be a bit disappointed in my performance findings: I certainly am. The fact is, using UNPIVOT instead of UNION in step 1) is clearly better, because it does one index scan instead of two. Other than that, Alberto’s solution and my pre-12c “improvements” have equivalent performance and use about the same resources. On the other hand, MATCH_RECOGNIZE is a little slower because it uses more CPU. In conclusion, in version 11.2 I would keep Alberto’s solution (except for that first UNION) because it is easier to maintain, and I would use MATCH_RECOGNIZE in version 12c for the same reason (unless the bit of extra CPU is a problem).