Date Ranges in Data Warehouses using Oracle 12c

When you load data with an “effective date” into a Data Warehouse, what happens when the new data is the same as yesterday’s? Do you add a new record with the new “effective date” or do you update a date range in the existing record?

At Open World last month, I got both answers: Tim Gorman presented an efficient method for loading “effective dates” and Dr. Holger Friedrich presented a way to maintain date ranges.

I mentioned to Dr. Friedrich that the 12c MATCH_RECOGNIZE clause could maintain those date ranges more efficiently. By the time I actually wrote the SQL, Dr. Friedrich had written practically the same thing. Here is my variant, inspired from Dr. Friedrich’s test case.

The Test Data

I have three tables:

  • ALL_DATA is the Data Warehouse. Each row is valid starting from DATE_FROM (inclusive) to DATE_TO (exclusive). The “current” rows have a DATE_TO of 9999-12-31.
  • CURRENT_DATA contains all the rows in ALL_DATA that have DATE_TO = 9999-12-31.
  • STAGING_DATA contains the data to be loaded. DATE_EFFECTIVE is the “effective date”.

To help me see what happens, I increment a sequence every time I load data to ALL_DATA. Updated rows are assigned the sequence number + 1/10, and new rows are assigned the sequence number + 2/10.

drop sequence load_seq;
create sequence load_seq;

drop table all_data purge;
create table all_data as
select 1 pk_id,
date '2014-11-13' date_from,
date '9999-12-31' date_to,
'A' rec_value,
load_seq.nextval+.2 load_id
from dual;

drop table current_data purge;
create table current_data as
select * from all_data;

drop table staging_data purge;
create table staging_data(pk_id, date_effective, rec_value)
as select
1, date '2014-12-01', 'A' from dual union all select
1, date '2014-12-02', 'B' from dual union all select
1, date '2014-12-03', 'B' from dual union all select
1, date '2014-12-04', 'C' from dual union all select
1, date '2014-12-05', 'A' from dual union all select
1, date '2014-12-06', 'A' from dual union all select
1, date '2014-12-07', 'D' from dual;
  • Line 20: The first row should be combined with the existing row in ALL_DATA
  • Lines 21-22: should be combined into one row
  • Lines 24-25: should be combined into one row.

Finding the Data to Change

Dr. Friedrich used the method I call “Start of Group” to find out what data to change. MATCH_RECOGNIZE replaces this method more efficiently and with less code.

select * from (
  select pk_id, date_effective, rec_value
  from staging_data
  union all
  select pk_id, date_from, rec_value
  from all_data
  where date_to >= (select min(date_effective) from staging_data)
) match_recognize (
  partition by pk_id order by date_effective
  measures first(date_effective) date_from,
    nvl(next(date_effective), date '9999-12-31') date_to,
    rec_value rec_value
  pattern(a b*)
  define b as rec_value = prev(rec_value)
);
  • Lines 5-7: I make sure to include all the rows in ALL_DATA that could be impacted by the new data.
  • Lines 13-14: In this simple example, REC_VALUE is the only data that can change. I group together all the consecutive rows that have the same value.
PK_ID DATE_FROM DATE_TO REC_VALUE
1 2014-11-13 2014-12-02 A
1 2014-12-02 2014-12-04 B
1 2014-12-04 2014-12-05 C
1 2014-12-05 2014-12-07 A
1 2014-12-07 9999-12-31 D

 

Merging the Changes

var load_id number;
exec :load_id := load_seq.nextval;

merge into all_data o
using (
  select * from (
    select pk_id, date_effective, rec_value
    from staging_data
    union all
    select pk_id, date_from, rec_value
    from all_data
    where date_to >= (select min(date_effective) from staging_data)
  ) match_recognize (
    partition by pk_id order by date_effective
    measures first(date_effective) date_from,
      nvl(next(date_effective), date '9999-12-31') date_to,
      rec_value rec_value
    pattern(a b*)
    define b as rec_value = prev(rec_value)
  )
) n
on ( (o.pk_id, o.date_from) = ((n.pk_id, n.date_from)) )
when matched then update set
  load_id = to_number(:load_id)+.1, date_to = n.date_to
  where o.date_to != n.date_to
when not matched then insert values(
  n.pk_id, n.date_from, n.date_to, n.rec_value, :load_id+.2
);

5 rows merged.
  • Line 25: In my USING clause, I may get rows from ALL_DATA that don’t need to be changed, so I check DATE_TO to make sure I don’t do an update for nothing.
  • Line 30: As you can see from the output below, I updated the existing row and inserted 4 new rows.
select * from all_data order by 3;
PK_ID DATE_FROM DATE_TO REC_VALUE LOAD_ID
1 2014-11-13 2014-12-02 A 2.1
1 2014-12-02 2014-12-04 B 2.2
1 2014-12-04 2014-12-05 C 2.2
1 2014-12-05 2014-12-07 A 2.2
1 2014-12-07 9999-12-31 D 2.2

 

Refreshing CURRENT_DATA

merge into current_data o
using (
  select * from all_data
  where date_to = date '9999-12-31'
) n
on (o.pk_id = n.pk_id)
when matched then update set
  load_id = n.load_id, date_from = n.date_from,
  rec_value = n.rec_value
  where o.load_id != n.load_id
when not matched then insert values(
  n.pk_id, n.date_from, n.date_to, n.rec_value, n.load_id
);

1 rows merged.

Avoiding Unnecessary Updates

If you run each MERGE statement a second time, you will see the lovely message “0 rows merged.” I always do this test. If you are changing data the second time, either there is a bug or you are updating a row to be the same as it was before, which is a lot of work for no benefit.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s