When you load data with an “effective date” into a Data Warehouse, what happens when the new data is the same as yesterday’s? Do you add a new record with the new “effective date” or do you update a date range in the existing record?
At Open World last month, I got both answers: Tim Gorman presented an efficient method for loading “effective dates” and Dr. Holger Friedrich presented a way to maintain date ranges.
I mentioned to Dr. Friedrich that the 12c MATCH_RECOGNIZE clause could maintain those date ranges more efficiently. By the time I actually wrote the SQL, Dr. Friedrich had written practically the same thing. Here is my variant, inspired from Dr. Friedrich’s test case.
The Test Data
I have three tables:
- ALL_DATA is the Data Warehouse. Each row is valid starting from DATE_FROM (inclusive) to DATE_TO (exclusive). The “current” rows have a DATE_TO of 9999-12-31.
- CURRENT_DATA contains all the rows in ALL_DATA that have DATE_TO = 9999-12-31.
- STAGING_DATA contains the data to be loaded. DATE_EFFECTIVE is the “effective date”.
To help me see what happens, I increment a sequence every time I load data to ALL_DATA. Updated rows are assigned the sequence number + 1/10, and new rows are assigned the sequence number + 2/10.
drop sequence load_seq; create sequence load_seq; drop table all_data purge; create table all_data as select 1 pk_id, date '2014-11-13' date_from, date '9999-12-31' date_to, 'A' rec_value, load_seq.nextval+.2 load_id from dual; drop table current_data purge; create table current_data as select * from all_data; drop table staging_data purge; create table staging_data(pk_id, date_effective, rec_value) as select 1, date '2014-12-01', 'A' from dual union all select 1, date '2014-12-02', 'B' from dual union all select 1, date '2014-12-03', 'B' from dual union all select 1, date '2014-12-04', 'C' from dual union all select 1, date '2014-12-05', 'A' from dual union all select 1, date '2014-12-06', 'A' from dual union all select 1, date '2014-12-07', 'D' from dual;
- Line 20: The first row should be combined with the existing row in ALL_DATA
- Lines 21-22: should be combined into one row
- Lines 24-25: should be combined into one row.
Finding the Data to Change
Dr. Friedrich used the method I call “Start of Group” to find out what data to change. MATCH_RECOGNIZE replaces this method more efficiently and with less code.
select * from ( select pk_id, date_effective, rec_value from staging_data union all select pk_id, date_from, rec_value from all_data where date_to >= (select min(date_effective) from staging_data) ) match_recognize ( partition by pk_id order by date_effective measures first(date_effective) date_from, nvl(next(date_effective), date '9999-12-31') date_to, rec_value rec_value pattern(a b*) define b as rec_value = prev(rec_value) );
- Lines 5-7: I make sure to include all the rows in ALL_DATA that could be impacted by the new data.
- Lines 13-14: In this simple example, REC_VALUE is the only data that can change. I group together all the consecutive rows that have the same value.
Merging the Changes
var load_id number; exec :load_id := load_seq.nextval; merge into all_data o using ( select * from ( select pk_id, date_effective, rec_value from staging_data union all select pk_id, date_from, rec_value from all_data where date_to >= (select min(date_effective) from staging_data) ) match_recognize ( partition by pk_id order by date_effective measures first(date_effective) date_from, nvl(next(date_effective), date '9999-12-31') date_to, rec_value rec_value pattern(a b*) define b as rec_value = prev(rec_value) ) ) n on ( (o.pk_id, o.date_from) = ((n.pk_id, n.date_from)) ) when matched then update set load_id = to_number(:load_id)+.1, date_to = n.date_to where o.date_to != n.date_to when not matched then insert values( n.pk_id, n.date_from, n.date_to, n.rec_value, :load_id+.2 ); 5 rows merged.
- Line 25: In my USING clause, I may get rows from ALL_DATA that don’t need to be changed, so I check DATE_TO to make sure I don’t do an update for nothing.
- Line 30: As you can see from the output below, I updated the existing row and inserted 4 new rows.
select * from all_data order by 3;
merge into current_data o using ( select * from all_data where date_to = date '9999-12-31' ) n on (o.pk_id = n.pk_id) when matched then update set load_id = n.load_id, date_from = n.date_from, rec_value = n.rec_value where o.load_id != n.load_id when not matched then insert values( n.pk_id, n.date_from, n.date_to, n.rec_value, n.load_id ); 1 rows merged.
Avoiding Unnecessary Updates
If you run each MERGE statement a second time, you will see the lovely message “0 rows merged.” I always do this test. If you are changing data the second time, either there is a bug or you are updating a row to be the same as it was before, which is a lot of work for no benefit.