Going to UKOUG Tech14

I’m on my way to UKOUG Tech14, where I will present Database 12c MATCH_RECOGNIZE: Beating the Best Pre-12c Solutions on Tuesday at 3 P.M., room 4B. With all the Oaktable ACE Directors presenting at the same time, I don’t expect a big crowd but I’ll do my best for those who do turn up.

This is my first User Group conference so I hope to see some of the denizens of the UGosphere. I met very briefly some luminaries at Open World and they were very nice folks indeed.

Now where is that France to UK adapter?

Date Ranges in Data Warehouses using Oracle 12c

When you load data with an “effective date” into a Data Warehouse, what happens when the new data is the same as yesterday’s? Do you add a new record with the new “effective date” or do you update a date range in the existing record?

At Open World last month, I got both answers: Tim Gorman presented an efficient method for loading “effective dates” and Dr. Holger Friedrich presented a way to maintain date ranges.

I mentioned to Dr. Friedrich that the 12c MATCH_RECOGNIZE clause could maintain those date ranges more efficiently. By the time I actually wrote the SQL, Dr. Friedrich had written practically the same thing. Here is my variant, inspired from Dr. Friedrich’s test case.

The Test Data

I have three tables:

  • ALL_DATA is the Data Warehouse. Each row is valid starting from DATE_FROM (inclusive) to DATE_TO (exclusive). The “current” rows have a DATE_TO of 9999-12-31.
  • CURRENT_DATA contains all the rows in ALL_DATA that have DATE_TO = 9999-12-31.
  • STAGING_DATA contains the data to be loaded. DATE_EFFECTIVE is the “effective date”.

To help me see what happens, I increment a sequence every time I load data to ALL_DATA. Updated rows are assigned the sequence number + 1/10, and new rows are assigned the sequence number + 2/10.

drop sequence load_seq;
create sequence load_seq;

drop table all_data purge;
create table all_data as
select 1 pk_id,
date '2014-11-13' date_from,
date '9999-12-31' date_to,
'A' rec_value,
load_seq.nextval+.2 load_id
from dual;

drop table current_data purge;
create table current_data as
select * from all_data;

drop table staging_data purge;
create table staging_data(pk_id, date_effective, rec_value)
as select
1, date '2014-12-01', 'A' from dual union all select
1, date '2014-12-02', 'B' from dual union all select
1, date '2014-12-03', 'B' from dual union all select
1, date '2014-12-04', 'C' from dual union all select
1, date '2014-12-05', 'A' from dual union all select
1, date '2014-12-06', 'A' from dual union all select
1, date '2014-12-07', 'D' from dual;
  • Line 20: The first row should be combined with the existing row in ALL_DATA
  • Lines 21-22: should be combined into one row
  • Lines 24-25: should be combined into one row.

Finding the Data to Change

Dr. Friedrich used the method I call “Start of Group” to find out what data to change. MATCH_RECOGNIZE replaces this method more efficiently and with less code.

select * from (
  select pk_id, date_effective, rec_value
  from staging_data
  union all
  select pk_id, date_from, rec_value
  from all_data
  where date_to >= (select min(date_effective) from staging_data)
) match_recognize (
  partition by pk_id order by date_effective
  measures first(date_effective) date_from,
    nvl(next(date_effective), date '9999-12-31') date_to,
    rec_value rec_value
  pattern(a b*)
  define b as rec_value = prev(rec_value)
);
  • Lines 5-7: I make sure to include all the rows in ALL_DATA that could be impacted by the new data.
  • Lines 13-14: In this simple example, REC_VALUE is the only data that can change. I group together all the consecutive rows that have the same value.
PK_ID DATE_FROM DATE_TO REC_VALUE
1 2014-11-13 2014-12-02 A
1 2014-12-02 2014-12-04 B
1 2014-12-04 2014-12-05 C
1 2014-12-05 2014-12-07 A
1 2014-12-07 9999-12-31 D

 

Merging the Changes

var load_id number;
exec :load_id := load_seq.nextval;

merge into all_data o
using (
  select * from (
    select pk_id, date_effective, rec_value
    from staging_data
    union all
    select pk_id, date_from, rec_value
    from all_data
    where date_to >= (select min(date_effective) from staging_data)
  ) match_recognize (
    partition by pk_id order by date_effective
    measures first(date_effective) date_from,
      nvl(next(date_effective), date '9999-12-31') date_to,
      rec_value rec_value
    pattern(a b*)
    define b as rec_value = prev(rec_value)
  )
) n
on ( (o.pk_id, o.date_from) = ((n.pk_id, n.date_from)) )
when matched then update set
  load_id = to_number(:load_id)+.1, date_to = n.date_to
  where o.date_to != n.date_to
when not matched then insert values(
  n.pk_id, n.date_from, n.date_to, n.rec_value, :load_id+.2
);

5 rows merged.
  • Line 25: In my USING clause, I may get rows from ALL_DATA that don’t need to be changed, so I check DATE_TO to make sure I don’t do an update for nothing.
  • Line 30: As you can see from the output below, I updated the existing row and inserted 4 new rows.
select * from all_data order by 3;
PK_ID DATE_FROM DATE_TO REC_VALUE LOAD_ID
1 2014-11-13 2014-12-02 A 2.1
1 2014-12-02 2014-12-04 B 2.2
1 2014-12-04 2014-12-05 C 2.2
1 2014-12-05 2014-12-07 A 2.2
1 2014-12-07 9999-12-31 D 2.2

 

Refreshing CURRENT_DATA

merge into current_data o
using (
  select * from all_data
  where date_to = date '9999-12-31'
) n
on (o.pk_id = n.pk_id)
when matched then update set
  load_id = n.load_id, date_from = n.date_from,
  rec_value = n.rec_value
  where o.load_id != n.load_id
when not matched then insert values(
  n.pk_id, n.date_from, n.date_to, n.rec_value, n.load_id
);

1 rows merged.

Avoiding Unnecessary Updates

If you run each MERGE statement a second time, you will see the lovely message “0 rows merged.” I always do this test. If you are changing data the second time, either there is a bug or you are updating a row to be the same as it was before, which is a lot of work for no benefit.

Statement-Level Atomicity

[UPDATE 2014/11/17: Horrible copy/paste error! My PL/SQL code below left out the all-important RAISE; command in the exception handler. Many thanks to Dom Brooks for giving me a heads up.]

So important yet often ignored or misunderstood! No, not me, but “statement level atomicity”.

Whenever you call the Oracle database to change data, the result will be all or nothing: Oracle will either do everything you asked it to, or nothing at all. This is true for SQL and PL/SQL.

What the documentation says

From the Concepts guide:

If a SQL statement causes an error during execution, then it is not successful and so all effects of the statement are rolled back. This operation is a statement-level rollback. This operation has the following characteristics:

  • A SQL statement that does not succeed causes the loss only of work it would have performed itself.The unsuccessful statement does not cause the loss of any work that preceded it in the current transaction…
  • The effect of the rollback is as if the statement had never been run.

There is no specific mention of PL/SQL here. I asked the author, Tom Kyte, about PL/SQL “statements” on asktom.oracle.com, “unhandled exceptions”. He replied:

any “statement” is an atomic statement.
every “statement” is
plsql is just a statement, so is update, they are all just statements.
all statements in Oracle are atomic…

I kept searching the Oracle documentation and finally found this in the TimesTen PL/SQL Developer’s Guide:

TimesTen PL/SQL differs from Oracle Database PL/SQL in a scenario where an application executes PL/SQL in the middle of a transaction, and an unhandled exception occurs during execution of the PL/SQL. Oracle Database rolls back to the beginning of the anonymous block. TimesTen does not roll back.

So there it is in black and white: it doesn’t matter whether you call the SQL engine or the PL/SQL engine, your “statement” either makes all the changes you asked for or none at all.

What Steven Feuerstein says

This blog was inspired by Steven’s recent tweet:

This may seem to contradict the Oracle documentation, but in fact it doesn’t. Steven is talking about what happens inside the PL/SQL code, whereas the documentation refers to what happens after the PL/SQL code has finished.

What my tests show

Here is a simple test case that shows how Steven is right.

CREATE TABLE T (N NUMBER);

set serveroutput on

declare
  procedure show_cnt(p_label in varchar2) is
    l_cnt number;
  begin
    select count(*) into l_cnt from t;
    dbms_output.put_line(p_label || ', count(*) = ' || l_cnt);
  end show_cnt;
begin
  show_cnt('At beginning');
  insert into t values(1);
  show_cnt('After good insert');
  insert into t values('a');
exception when invalid_number then
  show_cnt('After bad insert');
  -- the following line reraises the exception
  RAISE;
end;
/
Error report -
ORA-01722: invalid number
ORA-06512: at line 16
01722. 00000 -  "invalid number"
*Cause:    The specified number was invalid.
*Action:   Specify a valid number.
At beginning, count(*) = 0
After good insert, count(*) = 1
After bad insert, count(*) = 1

select count(*) from t;

  COUNT(*)
----------
         0 

select count(*) trans_cnt
from v$transaction;

 TRANS_CNT
----------
         0 

Notice how the first insert did NOT get rolled back within the PL/SQL block, but it DID get rolled back after the block ended!
Notice also that, since the PL/SQL block was the first statement in a transaction, there is no transaction anymore. The situation is exactly as if the statement never ran in the first place.

Conclusion

  • When a “client” calls the Oracle database, it doesn’t matter whether it calls the SQL engine or the PL/SQL engine, the entire call is a “statement” that will always be atomic.
  • When PL/SQL (inside the database) calls the SQL engine, the SQL statement is atomic. If the execution goes wrong, only that SQL statement is rolled back.
  • PL/SQL execution “goes wrong” if and only if an exception is passed back to the calling program. Without the RAISE in my exception handler, the exception would never go back to the caller and Oracle would not roll back the statement!
  • If the PL/SQL code does a COMMIT or ROLLBACK itself, statement-level atomicity will not work as intended.

OpenWorld Presentation on Row Pattern Matching

You can download my presentation from

http://www.slideshare.net/stewashton/row-patternmatching12coow14

Please download instead of viewing: the slideshare viewer doesn’t show the animations I worked so hard on.

I have already blogged about some of the topics in my presentation. Sometime soon I will continue with in depth looks at other issues, such as “catastrophic backtracking”.

To those many who missed it, I will be making the same presentation at UKOUG Tech 14 in December.

Oracle 12c Row Pattern Matching: Beat the Best Pre-12c Solutions at OpenWorld!

Catchy title, don’t you think? My session has been moved to Monday 4 P.M., in direct conflict with Tom Kyte – and Keith Laker, who asked me to present in the first place.

Avoid the lines: come see the MATCH_RECOGNIZE clause push great pre-12c solutions into retirement. As a bonus, be the first person on your block able to prevent “catastrophic backtracking”.

Click here to see my session in the OpenWorld catalog

Rob van Wijk and ad hoc Grouping

In March 2014, I wrote about two methods for ad hoc grouping: “Grouping Sequences” and “Start of Group“. I just found out that Rob van Wijk wrote on the same subjects two months earlier. We even used the same example!

I mention Rob’s article for two reasons:

  • If my explanation wasn’t clear enough, you can try Rob’s.
  • When I use “Start of Group”, Rob uses a similar method he calls “max-on-case-row-number”. These techniques are completely equivalent. Execution plans and performance are identical.

The “Start of Group” method can take many forms. Don’t think you need to know every one! Just choose the form you understand and like best, and use it all the time.

Don’t forget, when you get 12c you can replace both methods by the MATCH_RECOGNIZE clause :)

SQL and date ranges: don’t make NULL mean something

We often use NULL values in date ranges, usually in the end date to signify the indefinite future. I have decided that this is a bad idea; here are my reasons.

[Update 2014-07-04] There have been some great replies to this post. I am going to address some of their points.

1) NULL should not mean anything

According to the SQL standard, NULL means “unknown”. That is why you can’t really compare a NULL value to anything, you can only say if it IS NULL or not.

If you use NULL in date ranges, it doesn’t just mean something, it means two different things! In the start date it means “from the beginning of time”, and in the end date it means “‘until the end of time”. This in not only contrary to the “official” meaning of NULL, it is confusing.

[Update] Kevan Gelling points out that there may really be “unknown” values, in which case NULL should be allowed. I agree, as long as NULL is reserved for that use.

Jeffrey Kemp says: ‘In most cases an End Date of NULL means “we don’t know (yet) what the end date will be – or even if it will ever end”.’ Jeffrey, you should add “we don’t even know if it has already ended”! Your phrase taken alone implies that the end date is in the future, not in the past. This makes NULL meaningful.

2) We can use real date limits instead of NULL

One argument for using NULL is that it means the earliest possible start date, or the latest possible end date. We don’t need NULL for that! In Oracle, the earliest possible date is 4712-01-01 B.C., or DATE '-4712-01-01'. The latest possible date is DATE '9999-12-31' (you can add the time element 23:59:59 if you want to be a purist.)

To enforce this, I suggest declaring the start and end date columns as NOT NULL with default values. In Database 12c, I would use the DEFAULT ON NULL clause: this clause puts the default value in the column even if you explicitly try to put a NULL there.

If you want the output to show NULL instead of these default values, you do have to use something like NULLIF() in your SELECT clause.

[Update] Again, Kevan Gelling argues that NULL may be necessary to indicate a value that is not known. In that case my suggestion is no good, but you still have to use something other than NULL to indicate “beginning of time” and “end of time”.

Several posters bring up the fact that artificial extreme dates will “skew” the data and make it hard for the optimizer to choose the best plan. They are right, I should mention this. However, the data is skewed whether I use extreme dates or NULL. Supposing I do use NULL to mean “the end of time”, many queries will have to use END_DATE IS NULL in the WHERE clause, so the skew is there no matter what.

3) Oracle doesn’t index NULLs

When you query data with date ranges, you often have to check “greater than” one value and “less than” another value. This may require separate indexes on start date and end date. If you have NULLs in your date ranges, those indexes will not be used since Oracle doesn’t create index entries when all the indexed columns are NULL.

If you use real values and NOT NULL, your indexes will always work.

[Update] Some readers apparently didn’t see the phrase that I have now underlined. The header was probably misleading: Oracle will indeed index a row if any of the indexed columns is not NULL.

Some objected that indexes on dates alone are rarely used: date ranges are almost always applied to some object, so the index will include the object and one or both dates. In that case, the rows with NULL dates will be indexed. I agree. In that situation, NULLs don’t cause a problem, and “skew” won’t either, as long as the object comes before the date in the index column list.

4) Queries on NULLable date ranges are hard

I have blogged about various date range problems: finding gaps, finding overlaps and merging contiguous ranges. Almost always, my solutions worked fine without NULLs and broke when I introduced NULLs. Handling NULLs required either more complex queries or substituting real values for NULL. Why go to all that work when we can just use the real values to begin with?

[Update] Even if we need NULL values sometimes to mean “unknown value”, we still don’t use IS NULL in our queries, so indexes should work when needed.

5) Unique constraints don’t work with NULL

Most queries on date ranges, including mine, assume that start dates are unique. In some cases, it may also be useful to have unique end dates. The only way to make sure they are unique is to add a unique constraint. Unfortunately, uniqueness is only enforced for NOT NULL values.

Most of the time, your constraint will be on two columns: some object and a date. In this case the unique constraint will work as long as the object column is NOT NULL. However, anytime you need unique dates throughout a table you must define them as NOT NULL and use default values.

[Update] As mentioned under paragraph 3), this point is only valid for unique constraints on the date alone. This is probably a rare case. Also, as Kevan reminded me, you could always use a function-based index on (DATE_COL, ‘X’) to make sure every row was indexed.

[Update] Conclusion: NULL should mean “unknown” only

As readers have pointed out, my arguments 3) and 5) are weak, since they only apply to indexes or constraints on a date column alone, and those are rarely needed.

I’ll stand by arguments 1), 2) and 4), but thanks to Kevan I’ll add one thing: just because I don’t use NULL to mean “beginning or end of time”, that doesn’t mean I might not need it to mean “unknown value”. In that case, default values may not be the way to go.

Finally, please read the replies: good stuff, and thanks to all!