MATCH_RECOGNIZE Restrictions

Sometimes we think that certain restrictions are not documented when in fact they are. Where do we forget to look? Database Error Messages

As mentioned recently on asktom.oracle.com (thanks, Connor), all error messages and event codes are listed in $ORACLE_HOME/rdbms/mesg/oraus.msg. Searching for “pattern matching”, we read “62500 – 62549 are reserved for PATTERN MATCHING”.

So here is a link to all the MATCH_RECOGNIZE error messages: ORA-60001 to ORA-65535

As of version 19c, the range actually used is 62500 through 62521.

It was a surprise to discover that “bounded quantifiers” are limited to 200, but there it is in black and white:

“ORA-62518: MATCH_RECOGNIZE pattern is too large.”
“Permute with more than 7 elements or bounded quantifier bound greater than 200 are … currently not supported.”

I’m not suggesting we spend our lives reading database error messages, but we have to admit they are documented ;-)

Best regards,
Stew

P.S. Where are the values of UB2MAXVAL and UB4MAXVAL documented? If you know, please post a link in the comments below. I suspect UB2MAXVAL is 0XFFFF and UB4MAXVAL is 0XFFFFFFFF.

FOOTNOTE 2019-11-05 : thanks to Maxim Demenko for pointing out where UB*MAXVAL are documented. In the OCI Programmer’s Guide, at the very end of the chapter on Data Types, it says

“Throughout this guide there are references to data types like ub2 or sb4, or to constants like UB4MAXVAL. These types are defined in the oratypes.h header file, which is found in the public directory. The exact contents may vary according to the operating system that you are using.”

Backtracking

By popular request, here are my thoughts about the impact of “backtracking” on performance when using the MATCH_RECOGNIZE clause. This came up again because of a query that Jonathan Lewis wrote recently; however, I will concentrate on the theory, not the query.

Patterns

The match_recognize clause implements row pattern matching: it recognizes that a consecutive series of rows is a match for a defined pattern.

The pattern is described in the aptly named PATTERN clause. The syntax resembles a subset of the regular expression syntax. For example:

  • a regular expression pattern ‘AB’ means “find the character A immediately followed by the character B”.
  • In MATCH_RECOGNIZE, PATTERN(A B) means “find a row that meets condition A, immediately followed by a row that meets condition B”. The conditions are then described in the DEFINE clause.

Both syntaxes use two features that can lead to backtracking: quantifiers and alternation.

Quantifiers

PATTERN(A) means find exactly one row that meets condition A.

We can be more flexible in how many A rows we want:

  • A? means we want 0 or 1 A row;
  • A* means we want 0 or more rows;
  • A+ means we want 1 or more rows;
  • A{100} means we want exactly 100 rows;
  • A{3,100} means we want from 3 to 100 rows.

Notice the word “exactly” appears only once in this list. All the other quantifiers are what I’ll call indefinite: there can be more than one series of rows that match the pattern! Suppose we have 200 consecutive rows that meet condition A: the pattern A* could be met 201 different ways.

When a quantifier is indefinite, the rule is to match as many rows as possible: the quantifiers are greedy. If we add a question mark, the rule is to match as few rows as possible: the quantifier becomes reluctant.

  • A{3,100} will match 100 rows if it can.
  • A{3,100}? will match 3 rows and then stop.

Whether greedy or reluctant, indefinite quantifiers can lead to backtracking. More in a minute.

Alternation

To quote the documentation, an “alternation list is created by placing a vertical bar (|) between each regular expression. Alternatives are preferred in the order they are specified. As an example, PATTERN (A | B | C) attempts to match A first. If A is not matched, it attempts to match B. If B is not matched, it attempts to match C.”

Alternation is also indefinite: in this example, the number of rows is always the same, but the same row might meet any of three different conditions.

From now on I’ll concentrate on quantifiers, since they are much more common.

From “indefinite” to backtracking

Even though an indefinite quantifier means there is more than one correct answer, there is always one preferred answer, so what’s the big deal? Suppose there are 200 rows that meet the A condition:

  • A{1,100} returns 100 rows and
  • A{1,100}? returns 1 row.

Aha! But what if there is another condition after A?

With PATTERN(A{1,100} B), suppose there are 101 consecutive rows that meet A but not B.
A regular expression engine should find 100 A, then not find a B.
It will then backtrack, “giving back” A 100. It will then find there is no B.
It will then backtrack, “giving back” A 99. It will then find there is no B.
And so on all the way back to 1.

With PATTERN(A{1,100}? B), suppose there are 100 consecutive As followed by a B.
The engine should find 1 A, then not find a B.
It will then backtrack, adding A 2. It will then not find a B.
And so on all the way up to 100.

So backtracking does not mean “giving back” an A, it means backing up from B to A.

To summarize: backtracking can happen when an indefinite quantifier is followed by another condition. With greedy quantifiers, the worst backtracking happens when there is no match, because every possible solution must be tested. With reluctant quantifiers, backtracking may happen even when there is eventually a match.

Instrumentation?

There is one bit of instrumentation about backtracking in the explain plan. Here is a quote from Oracle development that Keith Laker sent me five years ago:

In the explain plan the specific pattern matching keywords are: MATCH RECOGNIZE (SORT | BUFFER) [DETERMINISTIC FINITE AUTO]

When the plan shows “MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO“, here “MATCH RECOGNIZE” refers to the row source for evaluating the match_recognize clause , “SORT” means the row source uses “SORT” to sort the data before running it through the state machine to find matches, and “DETERMINISTIC FINITE AUTO” means the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking. DETERMINISTIC FINITE AUTO” is desirable as the execution is efficient when there is no backtracking.

Currently we detect deterministic finite automaton by checking the state machine built: if any state has 2 or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. We don’t check the predicates associated with each transition at all. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.

For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic. We don’t check the mutual exclusiveness of the define predicates in detecting a deterministic state machine.

Conclusions

The quote from Oracle development confirms that alternation, or indefinite quantifiers followed by another condition, are possibly subject to backtracking. If we are lucky enough to see DETERMINISTIC FINITE AUTO, we know backtracking is not a problem. In testing, we should always test situations where no match is found. If there are reluctant quantifiers, we should also test situations where there is a match, but not right away.

Finally, each condition should be defined as strictly as possible, saying what it should be and also what it should not be. More than once, I have run into backtracking problems because the first condition was always true; once I defined the first condition more strictly, potential matches were eliminated much earlier and the query sped up considerably.

Hope this helps,
Stew

Predict_Clustering_Factor

Jonathan Lewis recently wrote about estimating the clustering factor of an index, taking into account the intended value of the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS. He included a function I wrote called predict_clustering_factor. Here is a corrected and improved version. Any questions or remarks about the code are welcome here.

Please refer to blog posts by Jonathan and Richard Foote for explanations of the clustering factor and the TABLE_CACHED_BLOCKS parameter.

UPDATE 2019-10-28: thanks to a comment by Viacheslav Andzhich on Jonathan’s post, I found that PLSQL_CODE_TYPE=’NATIVE’ makes the function run in two-thirds the time. He also mentioned the possibility of overflow on PLS_INTEGER, so I now use INTEGER only. Thanks Viacheslav!

Description

The function takes as input:
p_table_name: the table name
p_column_list: a comma-separated list of the column names to be indexed
p_max_table_cached_blocks: the MAXIMUM value of TABLE_CACHED_BLOCKS to be tested
(255 by default).

The function then queries the ROWIDs, sorted in index order:
– Rows are omitted if all index columns are NULL.
– In case of ties, rows are sorted by ROWID.
– Each ROWID has its “row number” portion set to 0, so it effectively becomes a block id.

The function collects the most recent p_max_table_cached_blocks block ids.
When there is no more room, it replaces the Least Recently Used block id with the current one.

lt_hits_per_RU is a table of intermediate results. The number of entries =  p_max_table_cached_blocks. When a new row has a block id among the most recent, I call that a “hit”. Every hit adds one to an entry in the result table: the index of the entry is based on how “recently used” the block id was. For example, if the current row and the previous row are in the same block, the index is one.

After all the rows are processed, the result table is transformed into a table of clustering factors: the clustering factor is the total number of rows minus all the hits up to and including the current entry.

In the output, the number of rows is equal to p_max_table_cached_blocks.
The ROWNUM provides the TABLE_CACHED_BLOCKS parameter value.

Sample of usage:

select rownum table_cached_blocks,
column_value clustering_factor
from table(predict_clustering_factor('T1','v1,rand'))

The code

ALTER SESSION SET PLSQL_CODE_TYPE='NATIVE';

create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.

The function takes as input:
- p_table_name: the table name
- p_column_list: a comma-separated list of the column names to be indexed
- p_max_table_cached_blocks: the MAXIMUM value of TABLE_CACHED_BLOCKS to be tested
  (255 by default).

The function then queries the ROWIDs, sorted in index order:
- Rows are omitted if all index columns are NULL.
- In case of ties, rows are sorted by ROWID.
- Each ROWID has its "row number" portion set to 0, so it effectively becomes a block id.

The function collects the most recent p_max_table_cached_blocks block ids.
When there is no more room, it replaces the Least Recently Used block id with the current one.

The function returns a table of clustering factors.
The number of rows is equal to p_max_table_cached_blocks.
The ROWNUM of the table is the TABLE_CACHED_BLOCKS parameter value.

Sample of usage:
  select rownum table_cached_blocks,
    column_value clustering_factor
  from table(predict_clustering_factor('T1','v1,rand'))
*/
  p_table_name in varchar2,
  p_column_list in varchar2,
  p_max_table_cached_blocks in number default 255
) return sys.odcinumberlist authid current_user is

  sql_text varchar2(4000);
  rc sys_refcursor;
  type tt_rids is table of rowid;
  lt_rids tt_rids;

  type t_block is record(
    block_id rowid,
    most_recent_hit integer
  );
  type tt_blocks is table of t_block;
  lt_blocks tt_blocks := new tt_blocks();
  l_block_id rowid;
  l_block_id_prev rowid;

  l_rn integer := 0;
  b_block_found boolean;
  l_LRU integer;
  i_LRU integer := 0;

  lt_hits_per_RU sys.odcinumberlist := new sys.odcinumberlist();
  i_hits_per_RU integer;

  function truncated_rid(p_rid in rowid) return rowid is
    l_rowid_type number;
    l_object_number NUMBER;
    l_relative_fno NUMBER;
    l_block_number NUMBER;
    l_row_number NUMBER;
  begin
    DBMS_ROWID.ROWID_INFO (
      p_rid,
      l_rowid_type,
      l_object_number,
      l_relative_fno,
      l_block_number,
      l_row_number
    );
    return DBMS_ROWID.ROWID_CREATE (
      l_rowid_type,
      l_object_number,
      l_relative_fno,
      l_block_number,
      0
    );
  end truncated_rid;

  function hits_per_RU(p_most_recent_hit in integer) return integer is
    i_hits_per_RU integer := 1;
  begin
    for i in 1..lt_blocks.count loop
      if lt_blocks(i).most_recent_hit > p_most_recent_hit then
        i_hits_per_RU := i_hits_per_RU + 1;
      end if;
    end loop;
    return i_hits_per_RU;
  end hits_per_RU;

begin
  -- Check for valid TABLE_CACHED_PARAMETER value
  if p_max_table_cached_blocks != trunc(p_max_table_cached_blocks)
  or p_max_table_cached_blocks not between 1 and 255 then
    raise_application_error(
      -20001,
      'input parameter p_max_table_cached_blocks must be an integer between 1 and 255'
    );
  end if;

  -- Initialize hits_per_RU table
  lt_hits_per_RU.extend(p_max_table_cached_blocks);
  for i in 1..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := 0;
  end loop;

  -- Execute query that mimics index
  sql_text :=
    'select rowid from '||p_table_name
    ||' where '||replace(p_column_list, ',', ' is not null or ')||' is not null'
    ||' order by '||p_column_list||', rowid';
  dbms_output.put_line('Query text: '||sql_text);
  open rc for sql_text;

  loop
    fetch rc bulk collect into lt_rids limit 10000;

    for irid in 1..lt_rids.count loop

      l_rn := l_rn + 1;
      l_block_id := truncated_rid(lt_rids(irid));

      -- Optimized treatment of first row
      if l_rn = 1 then
        lt_blocks.extend;
        lt_blocks(1).block_id := l_block_id;
        lt_blocks(1).most_recent_hit := l_rn;
        l_block_id_prev := l_block_id;
        continue;
      end if;

      -- Optimized treatment of consecutive rows in same block
      if l_block_id = l_block_id_prev then
        lt_hits_per_RU(1) := lt_hits_per_RU(1) + 1;
        continue;
      end if;

      l_block_id_prev := l_block_id;
      l_LRU := l_rn;
      b_block_found := false;

      for i in 1..lt_blocks.count loop

        -- if the new block_id is never found,
        -- i_LRU will point to the Least Recently Used block
        if l_LRU > lt_blocks(i).most_recent_hit then
          l_LRU := lt_blocks(i).most_recent_hit;
          i_LRU := i;
        end if;

        -- if the new block_id is found,
        -- then how many blocks ago was it found?
        if lt_blocks(i).block_id = l_block_id then
          b_block_found := true;
          -- how recently used is the block?
          i_hits_per_RU := hits_per_RU(lt_blocks(i).most_recent_hit);
          -- update hit summary
          lt_hits_per_RU(i_hits_per_RU) := lt_hits_per_RU(i_hits_per_RU) + 1;
          -- the block_id was just hit, so update most_recent_hit value
          lt_blocks(i).most_recent_hit := l_rn;
          exit;
        end if;

      end loop;

      -- If new block_id, add to lt_blocks if there is room,
      -- otherwise overwrite Least Recently Used entry
      if not b_block_found then
        if lt_blocks.count <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>< p_max_table_cached_blocks then
          lt_blocks.extend;
          lt_blocks(lt_blocks.count).block_id := l_block_id;
          lt_blocks(lt_blocks.count).most_recent_hit := l_rn;
        else
          lt_blocks(i_LRU).block_id := l_block_id;
          lt_blocks(i_LRU).most_recent_hit := l_rn;
        end if;
      end if;

    end loop;

    exit when rc%notfound;
  end loop;

  close rc;

-- Prepare output
  -- calculate cumulative sum of hits
  for i in 2..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := lt_hits_per_RU(i) + lt_hits_per_RU(i-1);
  end loop;
  -- subtract cumulative hits from total number of rows to get
  -- clustering factor. ROWNUM provides the TABLE_CACHED_BLOCKS value.
  for i in 1..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := l_rn - lt_hits_per_RU(i);
  end loop;

  dbms_output.put_line('Total number of rows in index = '|| l_rn);
  return lt_hits_per_RU;

exception when others then
  if rc%isopen then
    close rc;
  end if;
  raise;

end predict_clustering_factor;
/

LISTAGG() as a Cumulative Function

LISTAGG() can be used as an analytic function, but cannot be cumulative. What does that mean, and is there a workaround?

I’ll demonstrate with the boring old DEPT and EMP tables, just because they have so few rows, and I’ll keep only the department and employee names:

create or replace view denames as
select dname, ename from dept join emp using(deptno);

select * from denames
order by dname, ename;
DNAME ENAME
ACCOUNTING CLARK
ACCOUNTING KING
ACCOUNTING MILLER
RESEARCH ADAMS
RESEARCH FORD
RESEARCH JONES
RESEARCH SCOTT
RESEARCH SMITH
SALES ALLEN
SALES BLAKE
SALES JAMES
SALES MARTIN
SALES TURNER
SALES WARD

 

Reminder: aggregate versus analytic

“Aggregate functions return a single result row based on groups of rows, rather than on single rows.” (19c doc)

select dname,
  listagg(ename, ',') within group(order by ename) enames
from denames
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

By contrast, “analytic functions…differ from aggregate functions in that they return multiple rows for each group…All joins and all WHERE, GROUP BY, and HAVING clauses are completed before the analytic functions are processed.” (19c). In other words, the output rows are already determined and any analytic function simply adds a column to each existing row. The “analytic clause”, starting with the keyword OVER(), says what rows provide data for the function.

select de.*,
  listagg(ename, ',') within group(order by ename)
    over(partition by dname)
  enames
from denames de;
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK,KING,MILLER
ACCOUNTING KING CLARK,KING,MILLER
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH FORD ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH JONES ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES BLAKE ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES JAMES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES MARTIN ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES TURNER ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES WARD ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

LISTAGG() Limitations

The “analytic clause” has three parts:

  1. PARTITION BY: like GROUP BY, separates the data into chunks based on one or more values;
  2. ORDER BY: within the “partition”, orders the rows and sets up the next part;
  3. Windowing clause: says what rows within the ordered partition are concerned by the function.

Some analytic functions allow only the first or second parts, and LISTAGG() is one of them. Let’s try to use it as a cumulative function:

select de.*,
  listagg(ename, ',') within group(order by ename)
    over(partition by dname order by ename)
  enames
from denames de;

SQL Error: ORA-30487: ORDER BY not allowed here
30487. 00000 -  "ORDER BY not allowed here"
*Cause:    DISTINCT functions and RATIO_TO_REPORT cannot have an ORDER BY

The error message is a bit strange, but the conclusion is: it doesn’t work.

The workaround

The idea is simple: take what LISTAGG() gives us and do a SUBSTR() based on the row number:

select dname, ename,
  substr(enames, 1, instr(enames, ',', 1, rn) - 1) enames
from (
  select de.*,
    listagg(ename, ',') within group(order by ename)
      over(partition by dname)
    || ','
    enames,
    row_number() over(partition by dname order by ename) rn
  from denames de
);
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN ALLEN
SALES BLAKE ALLEN,BLAKE
SALES JAMES ALLEN,BLAKE,JAMES
SALES MARTIN ALLEN,BLAKE,JAMES,MARTIN
SALES TURNER ALLEN,BLAKE,JAMES,MARTIN,TURNER
SALES WARD ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

Caveats

This solution seems fine, as long as the source values are NOT NULL. If there is a NULL value, the “cumulative value” becomes NULL. Look at the last row of the output.

update emp set ename = null where ename = 'ALLEN';

(repeat the query)
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES BLAKE BLAKE
SALES JAMES BLAKE,JAMES
SALES MARTIN BLAKE,JAMES,MARTIN
SALES TURNER BLAKE,JAMES,MARTIN,TURNER
SALES WARD BLAKE,JAMES,MARTIN,TURNER,WARD
SALES

 

Also, with version 19c LISTAGG() allows the DISTINCT keyword. If you want to use DISTINCT, then instead of ROW_NUMBER() use DENSE_RANK()!

update emp set ename = 'MARTIN' where ename is null;

select dname, ename,
  substr(enames, 1, instr(enames, ',', 1, rn) - 1) enames
from (
  select de.*,
    listagg(DISTINCT ename, ',') within group(order by ename)
      over(partition by dname)
    || ','
    enames,
    DENSE_RANK() over(partition by dname order by ename) rn
  from denames de
);
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES BLAKE BLAKE
SALES JAMES BLAKE,JAMES
SALES MARTIN BLAKE,JAMES,MARTIN
SALES MARTIN BLAKE,JAMES,MARTIN
SALES TURNER BLAKE,JAMES,MARTIN,TURNER
SALES WARD BLAKE,JAMES,MARTIN,TURNER,WARD

 
With these precautions, why not use LISTAGG() as a cumulative function if you want? Hope this helps…

Making Longer Lists

For very long lists, we need the return values to be CLOBs. Alas, LISTAGG can only return VARCHAR2 lists, but XML and JSON aggregate functions can return CLOBs!

As an exercise, I’ll start from ALL_OBJECTS: I want to make a list of all fully qualified object names for each object type. To start with, a simple subquery:

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select * from data;

OBJECT_TYPE  OBJ
INDEX	     SYS.I_FILE#_BLOCK#
INDEX	     SYS.I_OBJ3
INDEX	     SYS.I_TS1
INDEX	     SYS.I_CON1
...

XML

XML functions work on the XMLTYPE data type, which has practically no length limitations. To make my list, I have to go through three steps:

  1. Make an XML fragment out of each object.
    This step will escape any special characters such as '<' or '>'.
  2. Aggregate the XML fragments.
  3. Extract the text from the aggregated XML data.
    This step will unescape the special characters.

I have to handle myself the comma that separates the elements of the list.

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(O, ',' || obj)
      order by obj
    ) AS CLOB
  ), 2
) qualified_objects
from data
group by object_type
order by object_type;

On my system with about 78000 objects, this takes 3 seconds and a bit.

JSON

JSON support in the Oracle database is pretty new. For full support of CLOBs, we need versions 18c or later (or a patched version 12.2, but I don’t know the precise version number).

The steps are a little different with JSON:

  1. I can aggregate the data directly into a JSON array.
    This escapes any special characters like double quotes, newlines, etc.
  2. To turn the multiple array items into one (which is my list), I replace "," (which separates the array items) with my separator.
  3. I then use JSON_VALUE to extract the list from the array.
    This unescapes any character that needs it.
with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
  json_value(
    replace(
      json_arrayagg(obj order by obj returning clob),
      '","',
      ','
    ),
    '$[0]' returning clob
  ) qualified_objects
from data
group by object_type
order by object_type;

Instead of 3+ seconds, this code takes a bit over 1 second, so three times faster.

As a reminder, JSON is great for going the other way: split the items of a list into a table. See Splitting Strings: a New Champion!

Interval expressions

I just learned about these critters thanks to an ODC forum question: how to calculate the difference between two dates in calendar years and months, days, hours, minutes and seconds.

What are they, and why are they?

Take any two datetime values (DATE, TIMESTAMP, TIMESTAMP WITH (LOCAL) TIME ZONE) and subtract one from the other: what do you get?

  • With two dates, you get the number of days.
  • With two timestamps, you get an INTERVAL DAY TO SECOND.

If you always want the interval, just put the two values in parentheses and add DAY TO SECOND. You can add YEAR TO MONTH instead to get an INTERVAL YEAR TO MONTH.

All the following combinations are OK:

select (sysdate - sysdate) day to second from dual;
select (sysdate - sysdate) year to month from dual;
select (sysdate - systimestamp) day to second from dual;
select (sysdate - systimestamp) year to month from dual;
select (systimestamp - sysdate) day to second from dual;
select (systimestamp - sysdate) year to month from dual;
select (systimestamp - systimestamp) day to second from dual;
select (systimestamp - systimestamp) year to month from dual;

Interval expressions are a handy way to get intervals (especially year-month intervals) from all datetime datatypes with the same code.

INTERVAL YEAR TO MONTH

INTERVAL YEAR TO MONTH is based on calendar years and months, so the “same” interval may cover a different number of days depending on the context. For example, adding INTERVAL '1' MONTH  to January 1 will add 31 days, but adding it to February 1, 2019 will add 28 days. In fact, sometimes it won’t add any days at all! Tell Oracle to add 1 month to January 31 and it will just say February 31 doesn’t exist. (Actually the error message is not quite that specific…).

The other surprise is rounding: since both year and month portions are integers, the month portion is rounded according to a rule I don’t fully understand:

  • (DATE '2016-02-29' - DATE '2016-02-15') year to month is 0 months, whereas
  • (DATE '2016-03-01' - DATE '2016-02-16') year to month is 1 month.

Keeping all this in mind, we can still use intervals and interval expressions to calculate the difference between two datetime values in calendar years and months, days, hours, minutes and seconds.

The answer, step by step

To split things up into smaller portions that are easier to understand, I’ll start with a view called QUESTION. We can change this view whenever we want different test data. Here I’ll test all possible date-only ranges (all time portions are 00:00:00) in the years 2015 through 2017.

create or replace view question as  
with dates(dte) as (  
  select date '2014-12-31' + level  
  from dual  
  connect by level <= 365+366+365  
)  
, ranges(date_from, date_to) as (  
  select a.dte, b.dte  
  from dates a, dates b  
  where a.dte < b.dte  
)  
select * from ranges;

My first draft of an answer is straightforward:

  • The year-month interval goes from the first day of the “from” month to the first day of the “to” month.
  • The day-second interval =
    • the day-second interval from date_from to date_to
    • minus the day-second interval from the first day of the “from” month to the first day of the “to” month.
create or replace view answer as  
select date_from, date_to,  
  (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month as iym,  
  (date_to - date_from) day(7) to second -  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
  as ids  
from question;

This solution lets us do “round trips”: if we have date_from, iym and ids, we can calculate date_to, and if we have date_to, iym and ids we can calculate date_from.

select a.*,  
  case when ids < interval '0' second  
    then date_from + ids + iym  
    else date_from + iym + ids  
  end new_to,  
  case when ids < interval '0' second  
    then date_to - iym - ids  
    else date_to - ids - iym  
  end new_from  
from answer a;

The only problem is that IDS, which is the day-second interval, is negative almost half the time. We can use that fact to calculate properly, but users will find it strange.

In order to satisfy the users, here is a second draft:

  • When IDS is negative
    • subtract 1 from IYM
    • add to IDS the number of days in the “from” month

If we want to avoid ambiguous results, we still have to say when the original IDS was negative.
To simplify “round trips”, I decided to add an IDS_ADJUSTMENT column with the number of days in the “from” month.

create or replace view adjusted_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

This gives plausible results to the users. The same DATE_FROM, IYM and IDS combination will sometimes occur for two different DATE_TO values, but the IDS_ADJUSTMENT column will allow us to do “round trips”.

select a.*,  
  case when ids_adjustment > interval '0' second  
    then date_from  
      + (ids - ids_adjustment)  
      + (iym + interval '1' month)  
    else date_from + iym + ids  
  end new_to,  
  case when ids_adjustment > interval '0' second  
    then date_to  
      - (iym + interval '1' month)  
      - (ids - ids_adjustment)  
    else date_to - ids - iym  
  end new_from      
from adjusted_answer a;

Again, almost half the ranges will have a positive IDS_ADJUSTMENT value. We can eliminate that when the “from” month and the month preceding date_to have the same number of days. (See lines 11,12 and 24.)

create or replace view final_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days,  
    (trunc(date_to,'mm') - (trunc(date_to,'mm') - interval '1' month)) day to second  
    as before_to_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second and from_month_days != before_to_month_days  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

Now IDS_ADJUSTMENT will only be positive 25% of the time (with this test data!) We can do “round trips” the same way as with the “adjusted answer”.

As a bonus, thanks to “interval expressions” we can change the data types in the QUESTION view and the solution will still work with no changes to the code.

Conclusion

We can calculate a sensible difference (including years and months) between two dates, but unless we provide additional information we cannot calculate one date using the other date and the difference.

Making Lists

Strings that contain multiple values violate first normal form, which mandates “one value per column per row”.  People still ask how to make or massage lists of values. This post is about aggregating values into lists.

Rewrite 2019-07-11: I am now using DNAME instead of DEPTNO so that the JOIN makes sense. Also Tony Hasler pointed out that JOIN before MODEL does not need an inline view.

(To split lists into little tables, see my series on “Splitting strings”.)

Chunks of lists

(For a discussion of this topic that has influenced my solutions, see Managing overflows in LISTAGG by Keith Laker, with solutions provided by Lei Sheng.)

Version 11.2 introduced LISTAGG, which concatenates values with or without a delimiter. For example:

select dname,
  listagg(ename, ',') within group(order by ename) enames
from dept join emp using(deptno)
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 
Unfortunately, the return value cannot be a CLOB; if it is too long, we get an ORA-01489 exception. Version 12.2 provides the ON OVERFLOW clause that truncates the return value, so no error but we don’t get all the values back.

Even before version 12.2, we could get all the values if we divided the list into chunks with a valid maximum length.

Starting with 12.1 and using the MATCH_RECOGNIZE clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select dname, ename from dept join emp using(deptno)
)
match_recognize(
  partition by dname order by ename
  measures match_number() chunk
  all rows per match
  pattern(s b*)
  define b as sum(lengthb(ename)) + count(b.*) <= 20
)
group by dname, chunk
order by dname, chunk;
DNAME CHUNK ENAMES
ACCOUNTING 1 CLARK,KING,MILLER
RESEARCH 1 ADAMS,FORD,JONES
RESEARCH 2 SCOTT,SMITH
SALES 1 ALLEN,BLAKE,JAMES
SALES 2 MARTIN,TURNER,WARD

 

Notice I am using LENGTHB to calculate the length in bytes of the input strings. COUNT(B.*) adds the lengths of the delimiters, assuming that the delimiter is one byte long. Lei uses LENGTHB even on the delimiter.

Before 12.1 we can get there with the MODEL clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select * from dept join emp using(deptno)
  model
    partition by (dname)
    dimension by (row_number() over(partition by dname order by ename) rn)
    measures(ename, lengthb(ename) length_so_far, 1 chunk)
    rules(
      length_so_far[rn>1] = lengthb(ename[cv()]) +
        case
          when lengthb(ename[cv()]) + length_so_far[cv()-1] + 1 <= 20
          then length_so_far[cv()-1] + 1
          else 0
        end,
      chunk[rn>1] = chunk[cv()-1] +
        case
          when length_so_far[cv()] = lengthb(ename[cv()])
          then 1
          else 0
        end    
    )
)
group by dname, chunk
order by dname, chunk;
  • Lines 10-15 : the first RULE calculates the running length of the list that would be made, starting over once the list gets past the limit of 20 bytes.
  • Lines 16-21 : the second RULE increases the CHUNK number whenever the running length starts over.
  • Finally, LISTAGG groups by DNAME and CHUNK.

But I want just one long string!

In that case, you need a CLOB. For that, you don’t even need version 11.2, just version 11.1.

select dname,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || ename)
      order by ename
    ) AS CLOB  -- AS VARCHAR2(4000) would be the equivalent of LISTAGG
  ), 2
) enames
from dept join emp using(deptno)
group by dname
order by dname;