SQL and date ranges: don’t make NULL mean something

We often use NULL values in date ranges, usually in the end date to signify the indefinite future. I have decided that this is a bad idea; here are my reasons.

[Update 2014-07-04] There have been some great replies to this post. I am going to address some of their points.

1) NULL should not mean anything

According to the SQL standard, NULL means “unknown”. That is why you can’t really compare a NULL value to anything, you can only say if it IS NULL or not.

If you use NULL in date ranges, it doesn’t just mean something, it means two different things! In the start date it means “from the beginning of time”, and in the end date it means “‘until the end of time”. This in not only contrary to the “official” meaning of NULL, it is confusing.

[Update] Kevan Gelling points out that there may really be “unknown” values, in which case NULL should be allowed. I agree, as long as NULL is reserved for that use.

Jeffrey Kemp says: ‘In most cases an End Date of NULL means “we don’t know (yet) what the end date will be – or even if it will ever end”.’ Jeffrey, you should add “we don’t even know if it has already ended”! Your phrase taken alone implies that the end date is in the future, not in the past. This makes NULL meaningful.

2) We can use real date limits instead of NULL

One argument for using NULL is that it means the earliest possible start date, or the latest possible end date. We don’t need NULL for that! In Oracle, the earliest possible date is 4712-01-01 B.C., or DATE '-4712-01-01'. The latest possible date is DATE '9999-12-31' (you can add the time element 23:59:59 if you want to be a purist.)

To enforce this, I suggest declaring the start and end date columns as NOT NULL with default values. In Database 12c, I would use the DEFAULT ON NULL clause: this clause puts the default value in the column even if you explicitly try to put a NULL there.

If you want the output to show NULL instead of these default values, you do have to use something like NULLIF() in your SELECT clause.

[Update] Again, Kevan Gelling argues that NULL may be necessary to indicate a value that is not known. In that case my suggestion is no good, but you still have to use something other than NULL to indicate “beginning of time” and “end of time”.

Several posters bring up the fact that artificial extreme dates will “skew” the data and make it hard for the optimizer to choose the best plan. They are right, I should mention this. However, the data is skewed whether I use extreme dates or NULL. Supposing I do use NULL to mean “the end of time”, many queries will have to use END_DATE IS NULL in the WHERE clause, so the skew is there no matter what.

3) Oracle doesn’t index NULLs

When you query data with date ranges, you often have to check “greater than” one value and “less than” another value. This may require separate indexes on start date and end date. If you have NULLs in your date ranges, those indexes will not be used since Oracle doesn’t create index entries when all the indexed columns are NULL.

If you use real values and NOT NULL, your indexes will always work.

[Update] Some readers apparently didn’t see the phrase that I have now underlined. The header was probably misleading: Oracle will indeed index a row if any of the indexed columns is not NULL.

Some objected that indexes on dates alone are rarely used: date ranges are almost always applied to some object, so the index will include the object and one or both dates. In that case, the rows with NULL dates will be indexed. I agree. In that situation, NULLs don’t cause a problem, and “skew” won’t either, as long as the object comes before the date in the index column list.

4) Queries on NULLable date ranges are hard

I have blogged about various date range problems: finding gaps, finding overlaps and merging contiguous ranges. Almost always, my solutions worked fine without NULLs and broke when I introduced NULLs. Handling NULLs required either more complex queries or substituting real values for NULL. Why go to all that work when we can just use the real values to begin with?

[Update] Even if we need NULL values sometimes to mean “unknown value”, we still don’t use IS NULL in our queries, so indexes should work when needed.

5) Unique constraints don’t work with NULL

Most queries on date ranges, including mine, assume that start dates are unique. In some cases, it may also be useful to have unique end dates. The only way to make sure they are unique is to add a unique constraint. Unfortunately, uniqueness is only enforced for NOT NULL values.

Most of the time, your constraint will be on two columns: some object and a date. In this case the unique constraint will work as long as the object column is NOT NULL. However, anytime you need unique dates throughout a table you must define them as NOT NULL and use default values.

[Update] As mentioned under paragraph 3), this point is only valid for unique constraints on the date alone. This is probably a rare case. Also, as Kevan reminded me, you could always use a function-based index on (DATE_COL, ‘X’) to make sure every row was indexed.

[Update] Conclusion: NULL should mean “unknown” only

As readers have pointed out, my arguments 3) and 5) are weak, since they only apply to indexes or constraints on a date column alone, and those are rarely needed.

I’ll stand by arguments 1), 2) and 4), but thanks to Kevan I’ll add one thing: just because I don’t use NULL to mean “beginning or end of time”, that doesn’t mean I might not need it to mean “unknown value”. In that case, default values may not be the way to go.

Finally, please read the replies: good stuff, and thanks to all!

About these ads

10 thoughts on “SQL and date ranges: don’t make NULL mean something

  1. it would be nice if there were constants for indefinite past and indefinite future like there are for numeric datatypes ( BINARY_DOUBLE_INFINITY and so on ).
    Unfortunately, NUMTOYMINTERVAL(BINARY_DOUBLE_INFINITY,’year’) raises
    ORA-01426: numeric overflow

  2. Stew,
    I also don’t like NULL in date ranges and agree with your conclusions – but for the sake of completeness I think it’s worth mentioning that a default end date 9999-31-12 may confuse the optimizer – at least if there are no suitable histograms (forcing the optimizer to calculate with an even distribution between low_value and high_value).

    Regards
    Martin

    • Martin, you are right. I’ll complete my post. It all depends on whether we want to index the date alone or some other column plus the date.

  3. If you are using Oracle 11gr2 or later, just make sure you turn off optimizer bind variable peeking (“_optim_peek_user_binds”=FALSE) or else your comparisons to high_date will ignore any calculated stats or histograms and go South in a big way.
    Yeah, it’s turned on by default and will KILL any query using high_values or low_value. Silly, but apparently it was thought out by “very smart” people so all of us “unwashed” have to accept it…
    And NULL values *can* be indexed. Look it up in Tom’s blog.

    • Nuno,

      Either I don’t understand all you say, or I disagree with some of it.

      Bind variable peeking started earlier than 11gR2. It allows the optimizer to use histograms more, not less; so the plan will be more optimized for the first execution — and maybe less for the next.

      Your basic premise is that using default values introduces “skew” in the data, which means that the same query doesn’t always want the same plan. That is true, but the skew exists with NULL too, it’s just treated differently in the statistics.

      I did not say that NULL values cannot be indexed: I said “Oracle doesn’t create index entries when all the indexed columns are NULL.” Tom’s blog says exactly that. It is true there are workarounds with function-based indexed to index every row, at the cost of complicating the queries.

  4. 1) NULL should not mean anything.
    If the end date represents the end of the active period of the record and this is not yet known then the end date is “unknown”

    2) We can use real date limits instead of NULL
    A pseudo hi-value could be misleading. If the end date is not yet unknown then a future value may imply that the record is active at a point of time in the future when it isn’t.

    3) Oracle doesn’t index NULLs
    Oracle will index NULL it the column isn’t the last column on the index column list,
    e.g. CREATE INDEX my_index ON my_table ( nullable_column, ‘x’ ) ;

    4) Queries on NULLable date ranges are hard
    Maybe, but the optimiser recognises NVL() and may rewrite the query to use underlying indexes. COALESCE() is not recognised though.

    5) Unique constraints don’t work with NULL
    Unique constraint do work if the ‘index NULLs’ trick from 3) is used

    • Kevan,

      1) Then we agree, NULL should keep its standard meaning.

      2) I see your point. If the date is question can really be “unknown”, then we should allow NULL to signify that.

      3) I said “Oracle doesn’t create index entries when all the indexed columns are NULL.”
      You only need one non-null value in the index column list; the non-null value doesn’t have to be at the end.

      4) I didn’t know that the optimizer recognizes NVL and not COALESCE. Anyway, the real problem is that functions like MAX() will ignore NULLs if it can, so the query has to be rewritten in a more complex manner before it can even be submitted to the optimizer.

      5) Agreed.

  5. I’ve seen both methods used widely; certainly the code becomes cleaner and simpler when the columns are declare NOT NULL.

    I don’t entirely agree with all your points – but also I don’t disagree with them all either :)

    1) “NULL should not mean anything” – I agree that NULL normally should mean “unknown”. In most cases an End Date of NULL means “we don’t know (yet) what the end date will be – or even if it will ever end”. A more purist solution would involve separate relations for rows that have no end date.

    2) “Real date limits” – this point is one where I disagree. What you have there are what are called (disparagingly) “magic” values – and magic values can themselves introduce problems. The problem with magic values is that they are indistinguishable from real values. In this case, of course, it’s extremely unlikely that the years -4712 or +9999 will ever be needed in any real application. In practice, however, they cause problems with SQL optimisation, because they are so far outside the range of values for the column. Histograms can go someway towards reducing the impact of this problem, but in many cases they still cause issues.

    3) “Oracle doesn’t index NULLs” – this is not quite true. As long as one column has a value, the row will appear in the index. Generally a table with an index on Start/End dates will benefit from having one or more discriminator columns in the index anyway.

    4) “Queries are hard” – this is the primary benefit of the approach you’re advocating, and it’s a good one. Queries are simpler, code is simpler, less chances for bugs etc. Generally when I have a NULLable column, I have to write the code assuming NULLs from the start – never deal with NULLs as an afterthought.

    5) “Unique constraints don’t work” – I disagree with this one. It would be a rare table where the unique constraint is *only* on the Start/End dates – as per my point (3) above, there will be at least another discriminator column. A unique constraint over (Emp ID, Start Date, End Date) will ensure that there are not two records with the same Emp ID and Start Date even if End date is null. Of course, we need more than just a unique constraint to stop overlapping date ranges, as you know.

    Finally, I wouldn’t necessarily dictate one method or the other – the most important advice I’d give anyone is to simply be consistent within a schema and don’t mix the two techniques.

    Thanks again for a good article.

  6. Stew,
    just another detail: there is a documented problem with a default 9999-12-31 in combination with interval partitioning (at least in 11.2): “Note, however, that using a date where the high or low bound of the partition would be out of the range set for storage causes an error. For example, TO_DATE(’9999-12-01′, ‘YYYY-MM-DD’) causes the high bound to be 10000-01-01, which would not be storable if 10000 is out of the legal range.” http://docs.oracle.com/cd/E18283_01/server.112/e16541/part_admin001.htm#insertedID2 – and an insert or even a select with a condition using 9999-12-31 results in “ORA-01841: (full) year must be between -4713 and +9999, and not be 0″. So maybe another default value would be more appropriate.

    • Martin,

      I just saw your comment. The same warning appears in the 12c documentation. Very interesting…

      I did some tests using the example from the documentation. If I try to insert a row with a NULL date I get the error:
      “ORA-14300: partitioning key maps to a partition outside maximum permitted number of partitions”

      The “last” possible partition on date should end in 10000-01-01, which as you say is illegal. This basically means that whenever we define an interval partition on date, then the last interval becomes illegal. If my interval is 1000 years, then anything from the year 9000 on is illegal.

      I’m not sure why anyone would want to partition on the end date of a date range, since that value is the one that tends to change and rows would change partitions all the time. If that is a requirement, then yes the default value must avoid that last “illegal” partition.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s