Joining Temporal Tables 3: Gaps

This is my third post about temporal tables that don’t allow overlaps. I’m going to add support for temporal gaps, which will also allow objects to stop existing at some point.

Test Data

I’m going to add a NOT NULL numeric field called START_GAP, which can be 0 or 1. When START_GAP = 0, the row contains data that is valid starting at FROM_DATE. When START_GAP = 1, then FROM_DATE is actually the end date of the previous row.

The data can now contain all the relations between time intervals that I listed previously.

alter session set nls_date_format='dd';
drop table a purge;

create table a (
  case_id varchar2(32),
  from_date date,
  start_gap number(1,0) not null check(start_gap in (0,1)),
  a_string varchar2(32),
  primary key(case_id, from_date)
);
Insert into A values ('01-precedes',to_date('2016-01-01','yyyy-mm-dd'),0,'A01-precedes');
Insert into A values ('01-precedes',to_date('2016-01-02','yyyy-mm-dd'),1,'A01-precedes');
Insert into A values ('02-meets',to_date('2016-01-01','yyyy-mm-dd'),0,'A02-meets');
Insert into A values ('02-meets',to_date('2016-01-02','yyyy-mm-dd'),1,'A02-meets');
Insert into A values ('03-overlaps',to_date('2016-01-01','yyyy-mm-dd'),0,'A03-overlaps');
Insert into A values ('03-overlaps',to_date('2016-01-03','yyyy-mm-dd'),1,'A03-overlaps');
Insert into A values ('04-finished by',to_date('2016-01-01','yyyy-mm-dd'),0,'A04-finished by');
Insert into A values ('04-finished by',to_date('2016-01-03','yyyy-mm-dd'),1,'A04-finished by');
Insert into A values ('05-contains',to_date('2016-01-01','yyyy-mm-dd'),0,'A05-contains');
Insert into A values ('05-contains',to_date('2016-01-04','yyyy-mm-dd'),1,'A05-contains');
Insert into A values ('06-starts',to_date('2016-01-01','yyyy-mm-dd'),0,'A06-starts');
Insert into A values ('06-starts',to_date('2016-01-02','yyyy-mm-dd'),1,'A06-starts');
Insert into A values ('07-equals',to_date('2016-01-01','yyyy-mm-dd'),0,'A07-equals');
Insert into A values ('07-equals',to_date('2016-01-02','yyyy-mm-dd'),1,'A07-equals');

drop table b purge;

create table b (
  case_id varchar2(32),
  from_date date,
  start_gap number(1,0) not null check(start_gap in (0,1)),
  b_string varchar2(32),
  primary key(case_id, from_date)
);

Insert into B values ('01-precedes',to_date('2016-01-03','yyyy-mm-dd'),0,'B01-precedes');
Insert into B values ('01-precedes',to_date('2016-01-04','yyyy-mm-dd'),1,'B01-precedes');
Insert into B values ('02-meets',to_date('2016-01-02','yyyy-mm-dd'),0,'B02-meets');
Insert into B values ('02-meets',to_date('2016-01-03','yyyy-mm-dd'),1,'B02-meets');
Insert into B values ('03-overlaps',to_date('2016-01-02','yyyy-mm-dd'),0,'B03-overlaps');
Insert into B values ('03-overlaps',to_date('2016-01-04','yyyy-mm-dd'),1,'B03-overlaps');
Insert into B values ('04-finished by',to_date('2016-01-02','yyyy-mm-dd'),0,'B04-finished by');
Insert into B values ('04-finished by',to_date('2016-01-03','yyyy-mm-dd'),1,'B04-finished by');
Insert into B values ('05-contains',to_date('2016-01-02','yyyy-mm-dd'),0,'B05-contains');
Insert into B values ('05-contains',to_date('2016-01-03','yyyy-mm-dd'),1,'B05-contains');
Insert into B values ('06-starts',to_date('2016-01-01','yyyy-mm-dd'),0,'B06-starts');
Insert into B values ('06-starts',to_date('2016-01-03','yyyy-mm-dd'),1,'B06-starts');
Insert into B values ('07-equals',to_date('2016-01-01','yyyy-mm-dd'),0,'B07-equals');
Insert into B values ('07-equals',to_date('2016-01-02','yyyy-mm-dd'),1,'B07-equals');
commit;

select * from a
union all
select * from b
order by 1,2,4;
CASE_ID FROM_DATE START_GAP A_STRING or B_STRING
01-precedes 01 0 A01-precedes
01-precedes 02 1 A01-precedes
01-precedes 03 0 B01-precedes
01-precedes 04 1 B01-precedes
02-meets 01 0 A02-meets
02-meets 02 1 A02-meets
02-meets 02 0 B02-meets
02-meets 03 1 B02-meets
03-overlaps 01 0 A03-overlaps
03-overlaps 02 0 B03-overlaps
03-overlaps 03 1 A03-overlaps
03-overlaps 04 1 B03-overlaps
04-finished by 01 0 A04-finished by
04-finished by 02 0 B04-finished by
04-finished by 03 1 A04-finished by
04-finished by 03 1 B04-finished by
05-contains 01 0 A05-contains
05-contains 02 0 B05-contains
05-contains 03 1 B05-contains
05-contains 04 1 A05-contains
06-starts 01 0 A06-starts
06-starts 01 0 B06-starts
06-starts 02 1 A06-starts
06-starts 03 1 B06-starts
07-equals 01 0 A07-equals
07-equals 01 0 B07-equals
07-equals 02 1 A07-equals
07-equals 02 1 B07-equals

 

What’s the idea?

As before, we need to carry down values and eliminate certain rows. How do we manage that for the new rows where START_GAP = 1?

  1. For table A, if the most recent START_GAP is 1 then A_STRING should be NULL. Same for B.
  2. If the most recent START_GAP is 1 for both tables, then that row should be eliminated.
  3. If the first row is from A, then B.START_GAP will be NULL. In that case B.START_GAP should be made = 1, since the B object did not exist at that point.
  4. As before, TO_DATE must be NULL or greater than FROM_DATE, otherwise the row is eliminated.

The MATCH_RECOGNIZE solution

Note that I don’t create a ROWTYPE column anymore: since START_GAP is defined as NOT NULL, I can use A.START_GAP to determine whether the current row comes from table A.

select case_id, from_date, to_date, a_str a_string, b_str b_string
from (
  select case_id, from_date, start_gap agap, a_string,
    null bgap, null b_string
  from a
  union all
  select case_id, from_date, null, null, start_gap bgap, b_string
  from b
)
match_recognize(
  partition by case_id order by from_date
  measures next(from_date) to_date,
    nvl(a.agap,1) + nvl(b.bgap,1) abgap,
    case a.agap when 0 then a.a_string end a_str,
    case b.bgap when 0 then b.b_string end b_str
  all rows per match
  pattern( (a|b)+ )
  define a as agap is not null
)
where (to_date > from_date or to_date is null)
  and abgap < 2;
CASE_ID FROM_DATE TO_DATE A_STRING B_STRING
01-precedes 01 02 A01-precedes
01-precedes 03 04 B01-precedes
02-meets 01 02 A02-meets
02-meets 02 03 B02-meets
03-overlaps 01 02 A03-overlaps
03-overlaps 02 03 A03-overlaps B03-overlaps
03-overlaps 03 04 B03-overlaps
04-finished by 01 02 A04-finished by
04-finished by 02 03 A04-finished by B04-finished by
05-contains 01 02 A05-contains
05-contains 02 03 A05-contains B05-contains
05-contains 03 04 A05-contains
06-starts 01 02 A06-starts B06-starts
06-starts 02 03 B06-starts
07-equals 01 02 A07-equals B07-equals

 

The Analytic solution

This one took me some time to figure out. I broke it down into 3 logical steps using WITH clauses.

  • ALL_ROWS just does the UNION ALL, while transforming NULL values to CHR(0)
  • ANALYTIC_ROWS applies all the analytic functions
  • The main SELECT transforms CHR(0) to NULL, makes a value NULL if the most recent START_GAP = 1, and eliminates excess rows.
with all_rows as (
  select case_id, from_date, start_gap agap, nvl(a_string, chr(0)) a_string,
    null bgap, null b_string
  from a
  union all
  select case_id, from_date, null, null,
    start_gap bgap, nvl(b_string, chr(0)) b_string
  from b
)
, analyzed_rows as (
  select case_id, from_date,
    lead(from_date) over(partition by case_id order by from_date) to_date,
    last_value(agap) ignore nulls
      over(partition by case_id order by from_date) agap,
    last_value(bgap) ignore nulls
      over(partition by case_id order by from_date) bgap,
    last_value(a_string) ignore nulls
      over(partition by case_id order by from_date) a_string,
    last_value(b_string) ignore nulls
      over(partition by case_id order by from_date) b_string
  from all_rows
)
select case_id, from_date, to_date,
  case when agap = 0 then nullif(a_string, chr(0)) end a_string,
  case when bgap = 0 then nullif(b_string, chr(0)) end b_string
from analyzed_rows
where (to_date > from_date or to_date is null)
  and nvl(agap,1) + nvl(bgap,1) < 2;

There’s more?

Yes indeed, I’m not through yet.

At this point I can define temporal tables that allow gaps but not overlaps. The attributes can be NULL but FROM_DATE can’t. The TO_DATE is not stored in the same row, but it can be displayed that way. I can also “join” two such tables in an efficient manner.

I only have one problem: your tables probably aren’t built that way. You have FROM_DATE and TO_DATE in each row, and if you say you don’t have overlapping ranges then you want me to believe you.

In my next post I’ll try to “join” tables that store FROM_DATE and TO_DATE, assuming there are no overlaps.

Advertisements

Summarize Data by Range

At the UKOUG Tech15 conference, Jonathan Lewis presented a question asked on OTN and the answer published on his blog. The problem is how to count the number of rows in one table that fall into the ranges stored in another table.

The straightforward solution in SQL is to join the two tables using a BETWEEN condition, but that involves comparing each row in one table to every row in the other table. With the questioner’s data, this solution takes seven hours.

Jonathan’s solution requires more than one SQL statement, but it can answer the question in a few minutes. Please refer to his blog for details about the problem statement and his solution.

If you have database version 12c or later, the MATCH_RECOGNIZE clause can help solve this problem directly. I’ll present two variants:

  • the first assumes that the ranges cannot overlap,
  • and the second handles any ranges, including overlaps.

Ranges without overlaps

Here are some test data using Jonathan’s table and column names:

define d_sqrt_rows = 10

create table msisdns as
with data as (
  select null from dual
  connect by level <= &d_sqrt_rows
)
select rownum msisdn
from data, data
where rownum < power(&d_sqrt_rows,2);

create table number_ranges as
select level + &d_sqrt_rows * (level - 1) from_number,
&d_sqrt_rows * level to_number
from dual
connect by level <= &d_sqrt_rows;

select * from number_ranges;
FROM_NUMBER TO_NUMBER
1 10
12 20
23 30
34 40
45 50
56 60
67 70
78 80
89 90
100 100

 

I start by combining the ranges and the individual rows in one dataset:

select from_number, to_number from number_ranges
union all
select msisdn, null from msisdns

Using the MATCH_RECOGNIZE clause, I can:

  • Order the data by from_number
  • then process the data in that order:
    • Assign the label “A” to the first range
    • Assign the label “B” to all the following individual rows that fall within the range
    • Return one row with the range from “A” and the count of “B” rows
    • Then go on to the next range, skipping any individual rows that fall after the first range and before the second.
select * from (
  select from_number, to_number from number_ranges
  union all
  select msisdn, null from msisdns
)
match_recognize(
  order by from_number, to_number 
  measures a.from_number from_number,
           a.to_number to_number,
           count(b.*) range_count
  pattern(a b*)
  define a as to_number is not null,
         b as from_number <= a.to_number
);
FROM_NUMBER TO_NUMBER RANGE_COUNT
1 10 10
12 20 9
23 30 8
34 40 7
45 50 6
56 60 5
67 70 4
78 80 3
89 90 2
100 100 0

 

Yes, the last RANGE_COUNT should be 0 because I only loaded the first 99 individual rows.

Ranges with overlaps

To solve this, I need to use MATCH_RECOGNIZE twice! This is fun for me, though it may not be for you…

First, I’ll update the number_range table to create overlaps:

update number_ranges set to_number = to_number+10;

select * from number_ranges;
FROM_NUMBER TO_NUMBER
1 20
12 30
23 40
34 50
45 60
56 70
67 80
78 90
89 100
100 110

 

Overlapping ranges require us to count some individual rows more than once, if they show up in more than one range. One way to handle this is to break down the ranges into what I call “base ranges” every time there is an intersection. For example, for ranges 1-4 and 3-6:

  • the “base ranges” are 1-2, 3-4 and 5-6
  • and the rows in range 3-4 must be counted both in range 1-4 and in range 3-6.

I identify the “base ranges” by putting both “from” and “to” numbers in the first column so I can sort them together:

select from_number, 0 rowtype from number_ranges
union all
select msisdn, 1 from msisdns
union all
select to_number, 2 rowtype from number_ranges

Once I order by from_number and rowtype, all the individual rows will be sandwiched between “from” or “to” numbers having a rowtype other than 1. So, I just return one row per “sandwich”. Notice that after each match I start from the last row of the previous match (“after match skip to last a”): the bottom slice of one sandwich becomes the top slice of the next sandwich.

with ranges_plus_data as (
  select from_number num, 0 rowtype from number_ranges
  union all
  select msisdn, 1 from msisdns
  union all
  select to_number, 2 rowtype from number_ranges
)
select * from ranges_plus_data
match_recognize(
  order by num, rowtype 
  measures first(num) from_number,
           last(num) to_number,
           count(b.*) range_count
  after match skip to last a
  pattern(a b+ a)
  define a as rowtype !=1,
         b as rowtype = 1
);
FROM_NUMBER TO_NUMBER RANGE_COUNT
1 12 11
12 20 9
20 23 2
23 30 8
30 34 3
34 40 7
40 45 4
45 50 6
50 56 5
56 60 5
60 67 6
67 70 4
70 78 7
78 80 3
80 89 8
89 90 2
90 100 9

 

Now I need to match these base ranges with each original range.

  • I combine original ranges and base ranges, then sort them.
  • Then I start from each original range and sum all the range_counts of the included base ranges.
  • To make sure I don’t skip an original range, I start looking for the next match one row after the beginning of the previous match (“after match skip to next row”).
with ranges_plus_data as (
  select from_number num, 0 rowtype from number_ranges
  union all
  select msisdn, 1 from msisdns
  union all
  select to_number, 2 rowtype from number_ranges
)
, base_range_sums as (
  select * from ranges_plus_data
  match_recognize(
    order by num, rowtype 
    measures first(num) from_number,
             last(num) to_number,
             count(b.*) range_count
    after match skip to last a
    pattern(a b+ a)
    define a as rowtype !=1,
           b as rowtype = 1
  )
)
select from_number, to_number, range_count
from (
  select * from base_range_sums
  union all
  select from_number, to_number, 0 from number_ranges
)
match_recognize(
  order by from_number, range_count, to_number
  measures first(from_number) from_number,
           first(to_number) to_number,
           sum(range_count) range_count
  after match skip to next row
  pattern(a (a|b)*)
  define a as range_count = 0,
         b as to_number <= first(to_number)
);
FROM_NUMBER TO_NUMBER RANGE_COUNT
1 20 20
12 30 19
23 40 18
34 50 17
45 60 16
56 70 15
67 80 14
78 90 13
89 100 11
100 110 0

With 40 million rows and 1000 ranges, Jonathan’s solution ran in less than two minutes. This solution runs in a little over a minute; it is not necessarily “better” but it is a reasonable alternative for a shop with 12c and a SQL developer who is familiar with MATCH_RECOGNIZE.

MATCH_RECOGNIZE: gaps in date ranges

To find gaps between date ranges, there is an elegant solution using analytic functions. We can follow the same logic with MATCH_RECOGNIZE, using a technique that avoids the problems with MIN() and MAX() comparisons I just wrote about.

Here are some simple test data

create table t ( start_date date, end_date date );
Insert into T values (DATE '2007-01-01', DATE '2007-01-15');
Insert into T values (DATE '2007-01-03', DATE '2007-01-10');
Insert into T values (DATE '2007-01-12', DATE '2007-01-25');
Insert into T values (DATE '2007-01-20', DATE '2007-02-01');
Insert into T values (DATE '2007-02-05', DATE '2007-02-10');
Insert into T values (DATE '2007-02-05', DATE '2007-02-28');
Insert into T values (DATE '2007-02-10', DATE '2007-02-15');
Insert into T values (DATE '2007-03-01', DATE '2007-03-02');
Insert into T values (DATE '2007-03-03', DATE '2007-03-16');

Please note that I use exclusive end dates: the second to last range ends when March 2d starts, so there should be a gap from March 2d to March 3d.

Using analytics, I compare the maximum end date so far to the next start date:

SELECT * FROM (
  SELECT MAX(end_date) OVER (ORDER BY start_date) start_gap,
  LEAD(start_date) OVER (ORDER BY start_date) end_gap
  FROM T
)
WHERE start_gap < end_gap;
START_GAP END_GAP
2007-02-01 2007-02-05
2007-02-28 2007-03-01
2007-03-02 2007-03-03

That is so simple – once you think of it.

Looking for a MATCH_RECOGNIZE solution, I tried to compare the current start date to the maximum prior end date. I ran into the problem I wrote about previously: when you use MAX() or other aggregates in a comparison, the current row is included.

I finally realized that if I compared the maximum end date to the next start date, the problem disappeared.

select start_gap, end_gap from t
match_recognize(
  order by start_date
  measures max(end_date) start_gap, next(start_date) end_gap
  all rows per match
  pattern((A|{-B-})+)
  define A as max(end_date) < next(start_date)
);

For those less familiar with the syntax

  • Every row that has a gap with the following row gets assigned to “A”, and all the other rows get assigned to “B”.
  • Normally, all rows per match will output every row, but {-b-} excludes the “B” rows from the output.

This code is logically equivalent to the analytic solution above. It also demonstrates how to avoid the knotty problem of how to exclude the current row from aggregates:

  1. Instead of comparing the aggregate to the current row, just back up a row.
  2. Now the aggregate contains exactly the rows you want.
  3. So compare that aggregate to the value of the row you care about, which is the “next” row.

MATCH_RECOGNIZE: matching based on aggregates

The heart of row pattern matching is finding which row matches what part of the pattern. Within the 12c MATCH_RECOGNIZE clause, DEFINE lists the conditions a row may meet; it doesn’t always work the way you expect, especially if you use aggregates in the condition.

How DEFINE works

The PATTERN clause lists the conditions that have to be met by a series of rows in order to match the pattern. The DEFINE clause defines each condition. For example:

PATTERN (A B)
DEFINE
  A as A.SOME_COLUMN = 'First row',
  B as A.SOME_COLUMN = 'First row' and B.SOME_COLUMN = 'Second row'

Do you see something strange here?

  • In the “A” condition, A.SOME_COLUMN refers to the row being evaluated.
  • In the “B” condition, A.SOME_COLUMN refers to a row that has already been accepted as meeting the “A” condition. B.SOME_COLUMN refers to the row being evaluated.

DEFINE starts by assuming the current row meets the condition. Then it evaluates the condition. If TRUE, the current row gets assigned. If FALSE, the current row does not get assigned to that condition.

Conditions with Aggregates

When you use an aggregate in DEFINE, you have to know when the aggregate will include the current row being evaluated. Here is an example:

PATTERN (A B)
DEFINE
  A as count(B.*) = 0,
  B as count(B.*) = 0

When the first row is evaluated, there are no B rows yet, and the current row is assumed to be part of “A”, so the “A” condition will be met. When the “B” condition is evaluated, the current row is assumed to be part of “B”, so count(B.*) will always be greater than 0 and the “B” condition will never be met.

Excluding the current row

If you use an aggregate that includes the current row’s value, and you want that row excluded, you have to do it yourself:

  • For COUNT(), subtract 1
  • For SUM(), subtract the value of the current row.

What about MIN() or MAX()?

I hope I haven’t lost you already, because the hard part is coming up. It’s also the interesting part, so take a deep breath and read on.

You can’t just “exclude” the current row from a MIN() or MAX() function. You have to keep track of which row has the minimum or maximum value, and then refer to that row. Here’s an example.

Let’s say I have stock market summaries each day: for each stock, I get the date, the closing price and the quantity traded.

create table ticker (
  stock_id integer,
  trade_date date,
  closing_price number not null,
  qty integer not null,
  primary key(stock_id, trade_date)
);
insert into ticker select 
1, sysdate, 20, 100 from dual union all select
1, sysdate+1, 30, 90 from dual union all select
1, sysdate+2, 15, 200 from dual union all select
1, sysdate+3, 14, 80 from dual union all select
1, sysdate+4, 20, 200 from dual;
STOCK_ID TRADE_DATE CLOSING_PRICE QTY
1 2015-07-02 07:23:37 20 100
1 2015-07-03 07:23:37 30 90
1 2015-07-04 07:23:37 15 200
1 2015-07-05 07:23:37 14 80
1 2015-07-06 07:23:37 20 200

 
I want to know when the stock hits a new low in price, or quantity, or both. To do this, I need to compare the current row’s price and quantity with the minimums up to but not including the current row.

I do this by defining four conditions:

  • PQ: a row containing a new low price and a new low quantity;
  • P: row with a new low price but not a new low quantity;
  • Q: row with a new low quantity but not a new low price;
  • OK: row without any new lows.

I also define two SUBSETs:

  • LOWP includes PQ and P rows: the most recent LOWP row has the minimum price so far.
  • LOWQ includes PQ and Q rows: the most recent LOWQ row has the minimum quantity so far.

Now, how do I access the row just before the current one? If I compare price to LAST(LOWP.price), I will be comparing the current price to itself.

To get the prior row, I just say LAST(LOWP.price,1). This backs up exactly one row.

select * from ticker
match_recognize(
  partition by stock_id order by trade_date
  measures classifier() new_low
  all rows per match
  pattern(pq (pq|p|q|ok)*)
  subset lowp = (pq,p), lowq = (pq,q)
  define
    pq as count(*) = 1 or (
      closing_price < last(lowp.closing_price,1) and qty < last(lowq.qty,1)
    ),
    p as closing_price < last(lowp.closing_price,1),
    q as qty < last(lowq.qty,1)
);
STOCK_ID TRADE_DATE NEW_LOW CLOSING_PRICE QTY
1 2015-07-02 07:23:37 PQ 20 100
1 2015-07-03 07:23:37 Q 30 90
1 2015-07-04 07:23:37 P 15 200
1 2015-07-05 07:23:37 PQ 14 80
1 2015-07-06 07:23:37 OK 20 200

 

OpenWorld Presentation on Row Pattern Matching

You can download my presentation from

http://www.slideshare.net/stewashton/row-patternmatching12coow14

Please download instead of viewing: the slideshare viewer doesn’t show the animations I worked so hard on.

I have already blogged about some of the topics in my presentation. Sometime soon I will continue with in depth looks at other issues, such as “catastrophic backtracking”.

To those many who missed it, I will be making the same presentation at UKOUG Tech 14 in December.

Rob van Wijk and ad hoc Grouping

In March 2014, I wrote about two methods for ad hoc grouping: “Grouping Sequences” and “Start of Group“. I just found out that Rob van Wijk wrote on the same subjects two months earlier. We even used the same example!

I mention Rob’s article for two reasons:

  • If my explanation wasn’t clear enough, you can try Rob’s.
  • When I use “Start of Group”, Rob uses a similar method he calls “max-on-case-row-number”. These techniques are completely equivalent. Execution plans and performance are identical.

The “Start of Group” method can take many forms. Don’t think you need to know every one! Just choose the form you understand and like best, and use it all the time.

Don’t forget, when you get 12c you can replace both methods by the MATCH_RECOGNIZE clause :)

Bin Fitting problems with SQL

“Bin Fitting” or “Bin Packing” means putting the greatest quantity in the smallest number of “bins” or containers. There are some good solutions with the MODEL clause, but the most concise and efficient solutions use the new 12c MATCH_RECOGNIZE clause.

There are two categories of “bin fitting” problems:

  1. The number of bins is fixed and the capacity can vary.
    > We want to equalize the quantities we put into each bin.
  2. The bins have a fixed capacity and the number of bins can vary.
    > We want to minimize the number of bins.

I’ll show an example of each.

Fill a fixed number of bins as equally as possible

Wikipedia calls this the “Multiprocessor scheduling problem“. Brendan Furey, author of a brilliant MODEL solution, calls it “balanced number partitioning“. Let’s take a simple, reproducible example:

CREATE TABLE items AS
SELECT LEVEL item_name, level item_value 
from dual
connect by level <= 10;

The total value of all the items is 55. We want to split the items into 3 “bins” or “buckets” or groups, so that the sums of the values are as equal as possible. Here the best we can do is put 19 in one bin and 18 in each of the other bins.

There are several algorithms that solve this problem more or less accurately. The one I’m going to show is fairly simple:

  • First, order the items by value in descending order
  • Then, assign each item to whatever bin has the smallest sum so far.

Here is Brendan’s MODEL implementation:

SELECT item_name, bin, item_value,
Max (bin_value) OVER (PARTITION BY bin) bin_value
FROM (
  SELECT * FROM items
  MODEL 
  DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn)
  MEASURES (
    item_name, 
    item_value,
    Row_Number() OVER (ORDER BY item_value DESC) bin,
    item_value bin_value,
    Row_Number() OVER (ORDER BY item_value DESC) rn_m,
    0 min_bin,
    Count(*) OVER () - :N_BINS - 1 n_iters
  )
  RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) (
    min_bin[1] =
      Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn <= :N_BINS],
    bin[ITERATION_NUMBER + :N_BINS + 1] = 
      min_bin[1],
    bin_value[min_bin[1]] =
      bin_value[CV()] + Nvl (item_value[ITERATION_NUMBER + :N_BINS + 1], 0)
  )
)
WHERE item_name IS NOT NULL
ORDER BY item_value DESC;
ITEM_NAME BIN ITEM_VALUE BIN_VALUE
10 1 10 19
9 2 9 18
8 3 8 18
7 3 7 18
6 2 6 18
5 1 5 19
4 1 4 19
3 2 3 18
2 3 2 18
1 3 1 18

Please refer to Brendan’s post for a full explanation, but notice the number of “bins” is actually a bind variable!

Unfortunately, this is not a very scalable solution: with 10,000 records, this query runs in about 30 seconds. My implementation with the 12C MATCH_RECOGNIZE clause has to hard-code the number of buckets, but it does 10,000 records in .2 seconds!

SELECT * from  items
MATCH_RECOGNIZE (
  ORDER BY item_value desc
  MEASURES to_number(substr(classifier(),4)) bin#,
    sum(bin1.item_value) bin1,
    sum(bin2.item_value) bin2,
    sum(bin3.item_value) bin3
  ALL ROWS PER MATCH
  PATTERN ((bin1|bin2|bin3)*)
  DEFINE bin1 AS count(bin1.*) = 1
    OR sum(bin1.item_value)-bin1.item_value
        <= least(sum(bin2.item_value), sum(bin3.item_value)),
    bin2 AS count(bin2.*) = 1
      OR sum(bin2.item_value)-bin2.item_value
        <= sum(bin3.item_value)
);
ITEM_NAME ITEM_VALUE BIN# BIN1_CUM_VAL BIN2_CUM_VAL BIN3_CUM_VAL
10 10 1 10
9 9 2 10 9
8 8 3 10 9 8
7 7 3 10 9 15
6 6 2 10 15 15
5 5 1 15 15 15
4 4 1 19 15 15
3 3 2 19 18 15
2 2 3 19 18 17
1 1 3 19 18 18

I’ve already mentioned the basic syntax of this clause in previous posts, but look at the PATTERN line. The vertical pipe means “OR” – I want a line that matches either “bin1”, “bin2″ or bin3”, and the first match wins.

The DEFINE clause takes some getting used to. The way it works is, you assume that the current row matches the first condition and you test the condition; if the condition is true then there really is a match and you move on to the next row. Otherwise you test the second condition, and so one until the last. In this case, I don’t have to define BIN3 at all: it’s always true if none of the previous matches work.

The BIN1 condition starts with count(bin1.*) = 1: this can only be true if no previous row has been assigned to BIN1. The other part of the condition compares the sums of BIN1, BIN2 and BIN3. If BIN1 has the smallest amount, it gets the current row. Now here’s the tricky part: if I just say sum(bin1.item_value), then that sum will include the row I am testing. What I really want is the sum so far, not including the row I am testing, so I have to subtract that value.

The BIN2 condition is simpler, because I already know that BIN1 contains more that BIN2.

Bins have a fixed capacity: use as few bins as possible

My first post on the MATCH_RECOGNIZE clause showed how to calculate running totals that start over before the total exceeds a certain value. This is a simple version of the “fixed capacity” problem.

I’m going to complicate things a bit by letting each bin have an elastic capacity. The normal capacity is 50, but in some cases I can go over the limit of 50, as long as I never go over 55.

  • If the current bin already has 50 or more, it is full and cannot receive a new item.
  • If the current bin has less than 50, I can add a new item as long as the new total is 55 or less.

I admit that I was hoping to solve this problem by using more advanced syntax in the PATTERN clause, such as “reluctant qualifiers”. I tried 4 or 5 different solutions and they all had bugs. The only solution that seems to work uses a simple PATTERN and a slightly more complex DEFINE. Anyway, here’s the code to create a test table with different boundary conditions.

drop table t cascade constraints purge;
create table t(cat_id number, item_id number, num_units number, incr number);

insert into t
select rownum cat_id, rownum item_id, num_units, incr from (
  select level incr from dual
  connect by level <= 6
) a
, lateral(
  select 51 - level num_units from dual
  connect by level <= a.incr+1
);
insert into t
select cat_id, rownum+100 item_id, incr num_units, incr from t
, lateral(
  select null n from dual
  connect by level <= (56 - t.num_units)/t.incr+1
);
insert into t values(100, 200, 41, 1);

select cat_id, item_id, num_units,
incr "Increment"
from t
where num_units > 40
order by cat_id, item_id;
CAT_ID ITEM_ID NUM_UNITS Increment
1 1 50 1
2 2 49 1
3 3 50 2
4 4 49 2
5 5 48 2
6 6 50 3
7 7 49 3
8 8 48 3
9 9 47 3
10 10 50 4
11 11 49 4
12 12 48 4
13 13 47 4
14 14 46 4
15 15 50 5
16 16 49 5
17 17 48 5
18 18 47 5
19 19 46 5
20 20 45 5
21 21 50 6
22 22 49 6
23 23 48 6
24 24 47 6
25 25 46 6
26 26 45 6
27 27 44 6
100 200 41 1

Each category is a different test case. The first item in each category has a value between 44 and 50. The following items all have the same values, indicated in the “Increment” column. I add a last case at the end where I don’t reach the limit of 50.

Notice the use of the LATERAL clause, which is now documented in 12c. It’s a lot simpler than the old TABLE(CAST(MULTISET syntax. Now here is the solution; I only show the first bin in each category, plus the first item in the second bin. I’ve got nothing more to say, so don’t scroll to the bottom if you don’t feel like it :)

select cat_id, bin#, sum_units, num_units, item_id
from t
match_recognize(
  partition by cat_id
  order by item_id
  measures final sum(num_units) sum_units,
  match_number() bin#, count(*) cnt
  all rows per match
  pattern(A+)
  define A as sum(num_units) - num_units < 50
    and sum(num_units) <= 55
)
where bin# = 1 or cnt = 1;
CAT_ID BIN# SUM_UNITS NUM_UNITS ITEM_ID
1 1 50 50 1
1 2 7 1 101
2 1 50 49 2
2 1 50 1 108
2 2 7 1 109
3 1 50 50 3
3 2 8 2 116
4 1 51 49 4
4 1 51 2 120
4 2 6 2 121
5 1 50 48 5
5 1 50 2 124
5 2 8 2 125
6 1 50 50 6
6 2 9 3 129
7 1 52 49 7
7 1 52 3 132
7 2 6 3 133
8 1 51 48 8
8 1 51 3 135
8 2 6 3 136
9 1 50 47 9
9 1 50 3 138
9 2 9 3 139
10 1 50 50 10
10 2 8 4 142
11 1 53 49 11
11 1 53 4 144
11 2 4 4 145
12 1 52 48 12
12 1 52 4 146
12 2 8 4 147
13 1 51 47 13
13 1 51 4 149
13 2 8 4 150
14 1 50 46 14
14 1 50 4 152
14 2 8 4 153
15 1 50 50 15
15 2 10 5 155
16 1 54 49 16
16 1 54 5 157
16 2 5 5 158
17 1 53 48 17
17 1 53 5 159
17 2 5 5 160
18 1 52 47 18
18 1 52 5 161
18 2 5 5 162
19 1 51 46 19
19 1 51 5 163
19 2 10 5 164
20 1 50 45 20
20 1 50 5 166
20 2 10 5 167
21 1 50 50 21
21 2 12 6 169
22 1 55 49 22
22 1 55 6 171
22 2 6 6 172
23 1 54 48 23
23 1 54 6 173
23 2 6 6 174
24 1 53 47 24
24 1 53 6 175
24 2 6 6 176
25 1 52 46 25
25 1 52 6 177
25 2 6 6 178
26 1 51 45 26
26 1 51 6 179
26 2 6 6 180
27 1 50 44 27
27 1 50 6 181
27 2 12 6 182
100 1 41 41 200