Joining Temporal Tables 1: virtual date ranges

In my presentation at UKOUG Tech 15, I explained that temporal tables with date ranges are very hard to manage. One way to simplify things is not to store date ranges at all! In this post, I’ll explain how to join two temporal tables when there are no gaps, no overlaps and no nulls in any column.

[Update 2016-01-14: I forgot to mention that this series of posts was inspired by a discussion with Richard Smith that started at UKOUG Tech 15. Thanks Richard!]

[Update 2016-01-15: replaced FULL JOIN USING() with UNION ALL for performance reasons, see below.]

Reminder: Allen’s time intervals

Dr. Allen’s Time Interval Algebra refers to “time intervals”. The SQL:2011 standard refers to “time periods”. I talk about “date ranges” simply because it is a more popular search term – forgive me! Anyway, for my purposes they all mean the same thing.

Here is a summary of the possible relations between two time intervals:

Time Interval Relationships

Although only the day of the month is shown, the boundaries are actually points in time, at exactly 12 A.M. each day. Notice that the start time is included in the interval, but the end time is not. This allows two intervals to “meet” at 12 A.M. without overlapping.

Temporal tables without gaps or overlaps

One of the hardest things about temporal tables is establishing constraints. Suppose you have a table with date ranges, FROM_DATE and TO_DATE:

  • You can say that FROM_DATE is NOT NULL
  • You can say that TO_DATE must be greater than FROM_DATE
  • But: how do you say that overlapping date ranges are not allowed? How do you say that gaps are not allowed?
    These are examples of “cross row” constraints that are very hard to implement successfully.

The simplest way to implement such constraints is to store FROM_DATE only, and calculate TO_DATE! That is what I mean by “virtual date ranges”.

Let’s take a very simple example to illustrate. Note that none of the columns can be NULL, that’s important.

drop table a purge;

create table a (
  case_id varchar2(32),
  from_date date,
  a_string varchar2(32) not null,
  primary key(case_id, from_date)
);

insert into a
select 'Meets only, no nulls' case_id,
  trunc(sysdate, 'MM') - 1 + level*level from_date,
  'Row ' || level a_string
from dual
connect by level <= 3;

select case_id, from_date,
lead(from_date) over (partition by case_id order by from_date) to_date,
a_string
from a;
CASE_ID FROM_DATE TO_DATE A_STRING
Meets only, no nulls 2016-01-01 2016-01-04 Row 1
Meets only, no nulls 2016-01-04 2016-01-09 Row 2
Meets only, no nulls 2016-01-09 Row 3

 

When two date ranges “meet”, the TO_DATE is always equal to the following FROM_DATE. So why store the TO_DATE? Just use the LEAD() function to get the TO_DATE from the following row. It’s so simple.

Here is another table with the same constraints as the first:

drop table b purge;

create table b (
  case_id varchar2(32),
  from_date date,
  b_string varchar2(32) not null,
  primary key(case_id, from_date)
);

insert into b
select 'Meets only, no nulls' case_id,
  trunc(sysdate, 'MM') - 1 + level*2 from_date,
  'Row ' || level b_string
from dual
connect by level <= 3;

select case_id, from_date,
lead(from_date) over (partition by case_id order by from_date) to_date,
b_string
from b;
CASE_ID FROM_DATE TO_DATE B_STRING
Meets only, no nulls 2016-01-02 2016-01-04 Row 1
Meets only, no nulls 2016-01-04 2016-01-06 Row 2
Meets only, no nulls 2016-01-06 Row 3

 

Joining gapless temporal tables with Analytics

[Update 2016-01-15: here is where I switch to UNION ALL.]

Before doing any analytic functions, I have to put the tables together somehow. I see two ways:

  • FULL JOIN on CASE_ID and FROM_DATE
  • UNION ALL

After testing, it seems that UNION ALL uses about half the CPU of the FULL JOIN.

select case_id, from_date, a_string, null b_string from a
union all
select case_id, from_date, null, b_string from b
order by 1,2;
CASE_ID FROM_DATE A_STRING B_STRING
Meets only, no nulls 2016-01-01 Row 1
Meets only, no nulls 2016-01-02 Row 1
Meets only, no nulls 2016-01-04 Row 2
Meets only, no nulls 2016-01-04 Row 2
Meets only, no nulls 2016-01-06 Row 3
Meets only, no nulls 2016-01-09 Row 3

 

This is not far from the answer we want. We need to “carry down” A_STRING and B_STRING values to the null columns in the rows below. Fortunately there is an analytic function that does just this: LAST_VALUE() with the IGNORE NULLS option. Once we do that, we just need to get rid of an extra row.

select * from (
  select case_id, from_date,
    lead(from_date) over(partition by case_id order by from_date) to_date,
    last_value(a_string)
      ignore nulls
      over(partition by case_id order by from_date)
    a_string,
    last_value(b_string)
      ignore nulls
      over(partition by case_id order by from_date)
    b_string
  from (
    select case_id, from_date, a_string, null b_string from a
    union all
    select case_id, from_date, null, b_string from b
  )
)
where to_date > from_date or to_date is null
order by 1,2;
CASE_ID FROM_DATE TO_DATE A_STRING B_STRING
Meets only, no nulls 2016-01-01 2016-01-02 Row 1
Meets only, no nulls 2016-01-02 2016-01-04 Row 1 Row 1
Meets only, no nulls 2016-01-04 2016-01-06 Row 2 Row 2
Meets only, no nulls 2016-01-06 2016-01-09 Row 2 Row 3
Meets only, no nulls 2016-01-09 Row 3 Row 3

 

Joining gapless temporal tables with MATCH_RECOGNIZE

The MATCH_RECOGNIZE clause lets us assign labels to the A and B rows so we can put columns from two input rows into one output row.

select case_id, from_date, to_date, astr a_string, bstr b_string
from (
  select 'A' rowtype, case_id, from_date, a_string, null b_string from a
  union all
  select 'B', case_id, from_date, null, b_string from b
)
match_recognize(
  partition by case_id order by from_date
  measures next(from_date) to_date, a.a_string astr, b.b_string bstr
  all rows per match
  pattern ( (a|b)+ )
  define a as rowtype = 'A'
)
where to_date > from_date or to_date is null;
  • In the MEASURES clause, A.<column> always refers to the most recent A row and B.<column> always refers to the most recent B row.
  • The PATTERN looks for exactly one match, with as many consecutive rows as possible, and where every row is either A or B.
  • Since B is not defined explicitly, it defaults to “always true”. It’s like an invisible “else”.
  • Since I say ALL ROWS PER MATCH, all the input columns show up in the output of MATCH_RECOGNIZE. That is why I have to rename A_STRING as ASTR: otherwise I would get an “ambiguous column” exception.

The output is exactly the same as from the analytic solution.

Limitations

There are two big limitations to these solutions:

  1. The values A_STRING and B_STRING cannot be NULL, or else the IGNORE NULLS option will not work right;
  2. There is no way to say that a CASE_ID does not exist anymore! It always exists from the latest FROM_DATE until the end of time.

In the following posts, I’m going to overcome these limitations, but be warned: the solutions are going to get more complex.

Advertisements

One thought on “Joining Temporal Tables 1: virtual date ranges

  1. Date ranges and operations on and between them are my favourite topic too. Spot on. Looking forward to your next post. I’m used to store the end date of a valid time period with “inclusive” semantics ( explicitly stating the last day the period has validity ), but I noticed (and I checked!) that your examples can be quite easily modified to my needs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s