Join Tables on Date Ranges

A recent question on the OTN SQL forum asked how best to join two tables related by ID and date range, in order to insert one row per date into a data warehouse. One solution was to expand the data from each table, creating one row per date, then join on date. I think it’s more efficient to join on date range, then expand. Here’s how.

Input Tables and Requirement

I am going to simplify a bit from the OTN setup.

The first table contains an ID, a start date and an end date. (I recommend to always use exclusive end dates.) The final output should have one row for every date within the date range.

create table table_1 (
  id number primary key,
  start_date date not null check (start_date = trunc(start_date)),
  end_date date not null check (end_date = trunc(end_date)),
  check (end_date > start_date)
);
TABLE_1 Output
ID START_DATE END_DATE EACH_DATE
1 2015-03-01 2015-03-05 2015-03-01
2015-03-02
2015-03-03
2015-03-04

 

The second table contains the ID and a “review date” that assigns a “group ID”. In the output, each row should contain the data from the most recent review – or null if there was no prior review.

create table table_2 (
  id number,
  review_date date not null check (review_date = trunc(review_date)),
  group_id varchar2(20) not null,
  primary key (id, review_date)
);
TABLE_1 Output TABLE_2
ID START_DATE END_DATE EACH_DATE REVIEW_DATE GROUP_ID ID REVIEW_DATE GROUP_ID
1 2015-03-01 2015-03-05 2015-03-01 (null) (null)
2015-03-02 2015-03-02 First 1 2015-03-02 First
2015-03-03 2015-03-02 First
2015-03-04 2015-03-04 Second 1 2015-03-04 Second

 

Note: the most recent review is used even if that review took place before the start date in table 1.

The algorithm: join, then expand

If you look at the pretty table just above, you might think that I need to generate the green rows, then join each table to them. That means joining three sources. When I tell you that each ID can have a date range of several months, then you see that the extra work and memory required are not negligible.

So let’s try to join the two tables directly, based on intersecting date ranges:

  1. TABLE_1 already has a start date and an end date.
  2. TABLE_2 has a review date, which starts a range.
    1. I use the LEAD() analytic function to get the next review date, which is the end of the range.
    2. To identify the first row for each ID, I use the LAG() analytic function to get the previous review date (which will be NULL on the first row). We’ll see why.
  3. I join the tables:
    when the review date is earlier than the end date,
    and the next review date is either null or later than the start date.
    It’s a LEFT JOIN since I produce rows whether the ID is in TABLE_2 or not.
  4. Now I have to create a new date range that represents just the intersection.
    1. The beginning is the later of start date and review date – except if the previous review date is null.
      In that case this is the first row for an ID, so the beginning is the start date.
    2. The end is the earlier of end date and next review date.
  5. Finally I expand the rows
    1. When expanding the first row, I may find a start date that is earlier than the review date (see 4A above).
      For days earlier than the review date, I output a row with a null review date and a null group id.

Test data

Allen’s Interval Algebra defines 13 possible relations between date ranges.

Day of month: 1 2 3 4
1: precedes 1 2
3 4
2: meets 1 2
2 3
3: overlaps 1 3
2 4
4: finished by 1 3
2 3
5: contains 1 4
2 3
6: starts 1 2
1 3
7: equals 1 2
1 2
8: started by 1 3
1 2
9: during 2 3
1 4
10: finishes 2 3
1 3
11: overlapped by 2 4
1 3
12: met by 2 3
1 2
13: preceded by 3 4
1 2

In my test data, ID 0 has no review date. IDs 1 through 13 have one review date, and IDs 14 through 26 have two. The review dates correspond to the start and end dates for the green rows in the table.

alter session set nls_date_format='DD';

insert into table_1 select
0, to_date(1),to_date(2) from dual union all select
1, to_date(1),to_date(2) from dual union all select
2, to_date(1),to_date(2) from dual union all select
3, to_date(1),to_date(3) from dual union all select
4, to_date(1),to_date(3) from dual union all select
5, to_date(1),to_date(4) from dual union all select
6, to_date(1),to_date(2) from dual union all select
7, to_date(1),to_date(2) from dual union all select
8, to_date(1),to_date(3) from dual union all select
9, to_date(2),to_date(3) from dual union all select
10, to_date(2),to_date(3) from dual union all select
11, to_date(2),to_date(4) from dual union all select
12, to_date(2),to_date(3) from dual union all select
13, to_date(3),to_date(4) from dual;

insert into table_1
select id+13, start_date, end_date
from table_1
where id > 0;

insert into table_2 select
1, to_date(3),'precedes' from dual union all select
2, to_date(2),'meets' from dual union all select
3, to_date(2),'overlaps' from dual union all select
4, to_date(2),'finished by' from dual union all select
5, to_date(2),'contains' from dual union all select
6, to_date(1),'starts' from dual union all select
7, to_date(1),'equals' from dual union all select
8, to_date(1),'started by' from dual union all select
9, to_date(1),'during' from dual union all select
10, to_date(1),'finishes' from dual union all select
11, to_date(1),'overlapped by' from dual union all select
12, to_date(1),'met by' from dual union all select
13, to_date(1),'preceded by' from dual;

insert into table_2
select id+13, review_date, group_id from table_2;

insert into table_2
select id,
review_date +
case when group_id = 'during' then 3
  when group_id in ('overlaps','starts','finishes','overlapped by') then 2
  else 1
end,
'after ' || group_id
from table_2
where id > 13;
commit;

alter session set nls_date_format='yyyy-mm-dd';

The solution (finally!)

select id, column_value each_date, start_date, end_date, 
  case when column_value >= review_date then review_date end review_date, 
  case when column_value >= review_date then group_id end group_id
from (
  select a.id, start_date, end_date, review_date, group_id,
  case when prev_rd is null
    then start_date
    else greatest(start_date, review_date)
  end range_start,
  case when end_rd is null
    then end_date
    else least(end_date, end_rd)
  end range_end
  from table_1 a
  left join (
    select review_date, group_id, id,
    lead(review_date) over (partition by id order by review_date) end_rd,
    lag(review_date) over (partition by id order by review_date) prev_rd
    from table_2
  ) b
  on a.id = b.id
    and start_date < coalesce(end_rd, end_date)
    and end_date > review_date
) 
,
table(cast(multiset(
  select range_start-1+level from dual
  connect by range_start-1+level < range_end
) as sys.odcidatelist))
order by id, each_date;
  • Lines 16-19: get previous and next review dates from table_2
  • Lines 21-23: left join on ID, and end date greater than review date, and start date less than next review date. If the next review date is null, we are on the last review date so we go all the way to the end date.
  • Lines 6-13: calculate the date range for the joined row.
  • Lines 2-3: leave the review date and group id columns NULL if EACH_DATE is earlier than the first review date.
ID EACH_DATE START_DATE END_DATE REVIEW_DATE GROUP_ID
0 2015-03-01 2015-03-01 2015-03-02
1 2015-03-01 2015-03-01 2015-03-02
2 2015-03-01 2015-03-01 2015-03-02
3 2015-03-01 2015-03-01 2015-03-03
3 2015-03-02 2015-03-01 2015-03-03 2015-03-02 overlaps
4 2015-03-01 2015-03-01 2015-03-03
4 2015-03-02 2015-03-01 2015-03-03 2015-03-02 finished by
5 2015-03-01 2015-03-01 2015-03-04
5 2015-03-02 2015-03-01 2015-03-04 2015-03-02 contains
5 2015-03-03 2015-03-01 2015-03-04 2015-03-02 contains
6 2015-03-01 2015-03-01 2015-03-02 2015-03-01 starts
7 2015-03-01 2015-03-01 2015-03-02 2015-03-01 equals
8 2015-03-01 2015-03-01 2015-03-03 2015-03-01 started by
8 2015-03-02 2015-03-01 2015-03-03 2015-03-01 started by
9 2015-03-02 2015-03-02 2015-03-03 2015-03-01 during
10 2015-03-02 2015-03-02 2015-03-03 2015-03-01 finishes
11 2015-03-02 2015-03-02 2015-03-04 2015-03-01 overlapped by
11 2015-03-03 2015-03-02 2015-03-04 2015-03-01 overlapped by
12 2015-03-02 2015-03-02 2015-03-03 2015-03-01 met by
13 2015-03-03 2015-03-03 2015-03-04 2015-03-01 preceded by
14 2015-03-01 2015-03-01 2015-03-02
15 2015-03-01 2015-03-01 2015-03-02
16 2015-03-01 2015-03-01 2015-03-03
16 2015-03-02 2015-03-01 2015-03-03 2015-03-02 overlaps
17 2015-03-01 2015-03-01 2015-03-03
17 2015-03-02 2015-03-01 2015-03-03 2015-03-02 finished by
18 2015-03-01 2015-03-01 2015-03-04
18 2015-03-02 2015-03-01 2015-03-04 2015-03-02 contains
18 2015-03-03 2015-03-01 2015-03-04 2015-03-03 after contains
19 2015-03-01 2015-03-01 2015-03-02 2015-03-01 starts
20 2015-03-01 2015-03-01 2015-03-02 2015-03-01 equals
21 2015-03-01 2015-03-01 2015-03-03 2015-03-01 started by
21 2015-03-02 2015-03-01 2015-03-03 2015-03-02 after started by
22 2015-03-02 2015-03-02 2015-03-03 2015-03-01 during
23 2015-03-02 2015-03-02 2015-03-03 2015-03-01 finishes
24 2015-03-02 2015-03-02 2015-03-04 2015-03-01 overlapped by
24 2015-03-03 2015-03-02 2015-03-04 2015-03-03 after overlapped by
25 2015-03-02 2015-03-02 2015-03-03 2015-03-02 after met by
26 2015-03-03 2015-03-03 2015-03-04 2015-03-02 after preceded by

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s