SQL table macros 2: “polymorphic” is not “dynamic”

A polymorphic SQL table macro can produce rows with different “shapes”, but the “shape” is established at parse time, not execution time. (By “shape” I mean the number of columns, their names and their characteristics such as data type.)

These macros work very differently from what we are used to. My goal here is to help understand how they work, not blow you away with an advanced use case. The examples are just illustrations, not meaningful solutions.

Current list of posts:

Starting simple: the TABLE_T parameter

Let’s call a Polymorphic (SQL) Table Macro a PTM for short. Here is a simple one:

SQL> create or replace function noop1(p_table in dbms_tf.table_t)
  2    return varchar2 sql_macro is
  3  begin
  4    dbms_output.put_line('table_schema_name: ' || p_table.table_schema_name);
  5    dbms_output.put_line('table_name: ' || p_table.table_name);
  6    return 'select * from p_table';
  7  end;
  8  /

Function NOOP1 compiled

The input parameter p_table has type DBMS_TF.TABLE_T. It is used in two different ways:

  1. Our PL/SQL code sees a record whose values we can access.
  2. The literal ‘p_table’ appears in the returned string where a table name would normally go, and the actual table name is substituted.

Think how strange that is: the code returns a literal, but part of the literal is really a placeholder for the value of the input parameter! The strangeness continues when we call the macro:

SQL> select * from noop1(dual);

D
-
X

table_schema_name: "SYS"
table_name: "DUAL"

SQL> select * from noop1("DUAL");

D
-
X

table_schema_name: "SYS"
table_name: "DUAL"

SQL> with x as (select * from dual)
  2  select * from noop1(x);

D
-
X

table_schema_name: ""
table_name: ""
  • Notice that the input parameter is not enclosed in single quotes: it is not a literal, but an identifier that is treated just the same as any other table name (or named subquery) that might be found in a SELECT statement.
  • When the input parameter refers to a named subquery, the schema and table name are left empty. The content of the named subquery is plugged into the SELECT statement as an inline view.
  • If we execute any of these SELECT statements twice in a row, the output from DBMS_OUTPUT.PUT_LINE will only appear once. This illustrates that the macro only executes when the statement is parsed.

Another simple example

If the TABLE_T input parameter references a real table or view (not a named subquery), we can plug the values from the input record into the return string.

create or replace function noop2(p_table in dbms_tf.table_t)
  return varchar2 sql_macro is
begin
  dbms_output.put_line('table_schema_name: ' || p_table.table_schema_name);
  dbms_output.put_line('table_name: ' || p_table.table_name);
  return
    case
      when p_table.table_name = '""' then
        'select * from p_table'
      else
        'select * from '|| p_table.table_schema_name || '.' || p_table.table_name
    end;
end;
/

Notice that in one case we write the parameter name and the macro uses it as a placeholder, but in the other case we concatenate input values into the SELECT statement ourselves.

Another new parameter type: COLUMNS_T

PTMs also accept input parameters of type DBMS_TF.COLUMNS_T. Let’s use this to decide which columns we want to see in the output.

create table t(a,b,c,d) as
select 1,2,3,4 from dual;

create or replace function pick_columns(
  p_table in dbms_tf.table_t,
  p_columns in dbms_tf.columns_t
)
  return varchar2 sql_macro is
begin
  return 'select p_columns from p_table';
end;
/
select * from pick_columns(t, columns(a,b,c));

ORA-00904: "P_COLUMNS": invalid identifier

Oops, it looks like automatic substitution doesn’t work. Let’s plug the values in ourselves.

create or replace function pick_columns(
  p_table in dbms_tf.table_t,
  p_columns in dbms_tf.columns_t
)
  return varchar2 sql_macro is
  l_columns varchar2(4000);
  l_sql varchar2(4000);
begin
  for i in 1..p_columns.count loop
    l_columns := l_columns
              || case when l_columns is not null then ',' end 
              || p_columns(i);
  end loop;
  l_sql := 'select ' || l_columns || ' from p_table';
  dbms_output.put_line(l_sql);
  return l_sql;
end;
/
select * from pick_columns(t, columns(a,b,c));

         A          B          C
---------- ---------- ----------
         1          2          3

select "A","B","C" from p_table

Lots of things to notice here:

  • When calling the function, the COLUMNS pseudo-operator must be used. The list contains valid simple identifiers.
  • The input appears to our code as a nested table containing quoted identifiers.
  • We have to plug the values into the SELECT statement ourselves.
  • When we print the statement ourselves, we see our own substitutions but the ‘p_table’ literal has not been replaced yet.

Other parameters

What happens if we use good old scalar parameters? This example takes as input a table, a column and a number, then returns the column value multiplied by the number.

create or replace function multiplier(
  p_table in dbms_tf.table_t,
  p_columns in dbms_tf.columns_t,
  p_multiplier in number
)
  return varchar2 sql_macro is
  l_sql varchar2(4000);
begin
  l_sql := 'select ' || p_columns(1) || ', ' 
            || p_columns(1) 
            || ' * p_multiplier multiplied_value from p_table';
  dbms_output.put_line('p_multiplier = <' || p_multiplier || '>');
  dbms_output.put_line(l_sql);
  return l_sql;
end;
/

SQL> select * from multiplier(t, columns(a), 9);

         A MULTIPLIED_VALUE
---------- ----------------
         1                9

p_multiplier = <9>
select "A", "A" * p_multiplier multiplied_value from p_table

The number 9 is a literal, and as such is part of the statement text; it is also available to the function code. However, in the returned string we should place the parameter name, not the value, just like the table parameter. We can see why if we use a bind variable instead of a literal.

SQL> var n number;
SQL> exec :n := 20;

PL/SQL procedure successfully completed.

SQL> select * from multiplier(t, columns(a), :n);

         A MULTIPLIED_VALUE
---------- ----------------
         1               20

p_multiplier = <>
select "A", "A" * p_multiplier multiplied_value from p_table

Notice that the value of the p_multiplier parameter is NULL. Values of “normal” parameters are not available to the macro when it executes during the parsing phase.

SQL> exec :n := 99;

PL/SQL procedure successfully completed.

SQL> select * from multiplier(t, columns(a), :n);

         A MULTIPLIED_VALUE
---------- ----------------
         1               99

Notice here that the DBMS_OUTPUT.PUT_LINE output doesn’t appear. Thanks to the use of a bind variable, the statement is not reparsed.

Synthesis

SELECT using a SQL table macro is a two-stage rocket:

  1. At hard parse time, the macro executes and returns a string containing a SELECT statement,
    which is then parsed.
  2. At execution time, the SELECT statement is executed. The macro doesn’t execute again, its work is done.

In other words, SQL table macros return code (the SELECT statement), not data, and they do their work at parse time, not execution time.

SQL table macros can have two new types of input parameters:

  • DBMS_TF.TABLE_T : refers to an existing table, view or named subquery
  • DBMS_TF.COLUMNS_T : refers to a list of identifiers that are valid column names, whether they exist or not.

These are in addition to “classic” parameters with SQL or PL/SQL data types.

The values of the TABLE_T and COLUMNS_T parameters are identifiers:

  • they are part of the SQL statement that SELECTs FROM the macro and cannot be replaced by bind variables, even with CURSOR_SHARING=FORCE;
  • their values are known at parse time, when the macro executes;

The “classic” parameters act differently depending on what the caller passes:

  • variables: values are NULL at parse time. They appear in the generated SELECT statement as bind variables.
    (The generated statement can be seen using DBMS_UTILITY.EXPAND_SQL_TEXT.)
  • literals: values are known at parse time and can be integrated directly into the SELECT statement.
    Careful, CURSOR_SHARING=FORCE turns the literals into bind variables, so never assume that “classic” parameter values are known at parse time!

At execution time, the values of the “classic” variable parameters are plugged into the bind variables of the final SELECT statement before execution.

Conclusion: “polymorphic” vs. “dynamic”

If you ask me, a “dynamic” table macro would be like a PL/SQL function that returns a REF CURSOR: the cursor can have a different “shape” at every execution, even when the parameters are the same, so the result is as “dynamic” as we want it to be. A SQL table macro is different, since it only works at parse time. That is why the documentation calls it “polymorphic” and never “dynamic”.

SQL table macros 1: a moving target?

Testing SQL table macros in the Cloud(19c, 21c) and on LiveSQL (19.8): behaviors are different, and the differences change with time! Here I’ll zoom in on the TABLE input parameter, as of 2021-02-16.

Current list of posts:

[UPDATE 2020-12-08: 20c preview is dead, long live 21c! This post has been revised after redoing my tests on a Free Tier 21c.]

[UPDATE 2021-02-16: thanks to a test by Philipp Salvisberg, I can say that version 19.10 has the same limitations as version 19.8.]

Too long to read?

This is a long post without a direct, practical SQL solution at the end. If you just want the gist:

  • SQL table macros let you take a table, view or named subquery as input and transform it into another query at parse time.
  • SQL table macros in 19.8 19.10 accept only scalar data types in the input, but in the Cloud (19c, 21c) they accept practically any data type.
  • The information about the input data types comes from different sources that are incomplete and sometimes in disagreement.
  • I hope Oracle will complete the back-port of SQL table macros to 19c and that they will correct and complete the information about the input data types.

Preliminaries

A few words for those who are wondering what in the world I am talking about:

  • in classical programming languages, a “macro” is a fragment of code with a name. When you write a program, you write the name; when the program is compiled, as a first step the name is replaced with the code fragment.
  • A SQL macro is a function that returns either a SQL expression or a SELECT statement. The macro is executed only at hard parse time, and results in a final SQL statement which then goes through the normal parsing process.
  • A SQL table macro returns an entire SELECT statement. It can have a table, view or named subquery as an input parameter: this allows it to be a “polymorphic table function”.
    • “Polymorphic” means “able to change shapes”; in a SELECT statement, the “shape” comes from the number, names and types of the columns in the output.
    • The main contributor to the “shape” is the table input parameter.
    • For now, I’m going to call this type of macro a “polymorphic table macro” (PTM) to distinguish it from other types of “polymorphic table functions”.

Simple example

create or replace function simplest_ptm(p_table in DBMS_TF.TABLE_T)
return varchar2 sql_macro is
begin
return 'select * from p_table';
end simplest_ptm;
/
select * from simplest_ptm(dual);

DUMMY
-----
X

There are at least two strange things to notice here:

  1. When I call the function, the parameter dual is an identifier, not a literal!
    dual is just as much a part of the SELECT statement as when I refer to it directly by select * from dual.
    Since it is part of the statement, changing the parameter changes the statement itself, which means the new statement has to be parsed again. This is what allows different “shapes”: if I change the parameter I get a new statement, which has to be parsed anew and therefore can have a new “shape”.
    (As an identifier, this parameter cannot be changed dynamically using a bind variable.)
  2. The function appears to return the literal ‘select * from p_table’ – but it doesn’t!
    Even though the return string is written as a literal, the “p_table” part is replaced by the table that p_table refers to.

Analyzing the table input parameter

The DBMS_TF.TABLE_T record has a complex structure that is used by all polymorphic table functions. When used in a SQL macro, some items are left null.

TYPE TABLE_T IS RECORD(
column                TABLE_COLUMNS_T,
schema_name           DBMS_id,  -- always null
package_name          DBMS_id,  -- always null
ptf_name              DBMS_id); -- always null

TYPE TABLE_COLUMNS_T IS TABLE OF COLUMN_T;

TYPE column_t IS RECORD (
description            COLUMN_METADATA_T,
pass_through           BOOLEAN,   -- always TRUE
for_read               BOOLEAN);  -- always FALSE

TYPE COLUMN_METADATA_T IS RECORD   -- THE GOOD STUFF
( type               PLS_INTEGER,
max_len            PLS_integer DEFAULT -1,
name               VARCHAR2(32767),
name_len           PLS_INTEGER,
precision          PLS_INTEGER,
scale              PLS_INTEGER,
charsetid          PLS_INTEGER,
charsetform        PLS_INTEGER,
collation          PLS_INTEGER );

The bottom line is that this structure contains lots of information about the columns, but not information about the table itself: we don’t even know what the name of the table is! This makes more sense when we remember that the input parameter can be a named subquery from a WITH clause.

[UPDATE 2020-12-27: the definitions above are excerpts from the documentation. Looking at the source code of the DBMS_TF package, we see that the TABLE_T record has additional fields table_schema_name and table_name! These fields have appropriate values when the input parameter refers to a table or view; if it refers to a subquery, the values are null.]

A simple way to see what is in this complex record is the DBMS_TF.TRACE procedure.

create or replace function simplest_ptm(
p_table in dbms_tf.table_t
)return varchar2 sql_macro is
begin
dbms_tf.trace(p_table);
return 'select * from p_table';
end simplest_ptm;
/
select * from simplest_ptm(dual);
/
........tab.column[1] =

pass through column

NOT a for read column
__________Type:...............VARCHAR2
__________Max Length:.........1
__________Name:..............."DUMMY"
__________Name Length:........7
__________Charset Id:.........AL32UTF8
__________Collation:..........USING_NLS_COMP

Of course, we can also use good old PL/SQL code to access all the fields directly. Using these methods, here is what I have found so far.

The Good

The input table parameter gives us lots of useful information about each column, starting with the data type. The data type is a code, and the DBMS_TF package specification tells us what each code means.

  /* Type Codes for supported types */
TYPE_VARCHAR2            CONSTANT PLS_INTEGER :=   1;
TYPE_NUMBER              CONSTANT PLS_INTEGER :=   2;
TYPE_DATE                CONSTANT PLS_INTEGER :=  12;
TYPE_RAW                 CONSTANT PLS_INTEGER :=  23;
TYPE_ROWID               CONSTANT PLS_INTEGER :=  69;
TYPE_CHAR                CONSTANT PLS_INTEGER :=  96;
TYPE_BINARY_FLOAT        CONSTANT PLS_INTEGER := 100;
TYPE_BINARY_DOUBLE       CONSTANT PLS_INTEGER := 101;
TYPE_CLOB                CONSTANT PLS_INTEGER := 112;
TYPE_BLOB                CONSTANT PLS_INTEGER := 113;
TYPE_JSON                CONSTANT PLS_INTEGER := 119; --21c only
TYPE_TIMESTAMP           CONSTANT PLS_INTEGER := 180;
TYPE_TIMESTAMP_TZ        CONSTANT PLS_INTEGER := 181;
TYPE_INTERVAL_YM         CONSTANT PLS_INTEGER := 182;
TYPE_INTERVAL_DS         CONSTANT PLS_INTEGER := 183;
TYPE_EDATE               CONSTANT PLS_INTEGER := 184;
TYPE_ETIMESTAMP_TZ       CONSTANT PLS_INTEGER := 186;
TYPE_ETIMESTAMP          CONSTANT PLS_INTEGER := 187;
TYPE_EINTERVAL_YM        CONSTANT PLS_INTEGER := 189;
TYPE_EINTERVAL_DS        CONSTANT PLS_INTEGER := 190;
TYPE_TIMESTAMP_LTZ       CONSTANT PLS_INTEGER := 231;
TYPE_ETIMESTAMP_LTZ      CONSTANT PLS_INTEGER := 232;

With this information, we can do all sorts of transformations of the query. For example, we can leave out columns of a certain data type, or format columns depending on the data type. This would be useful for formatting all the datetime data types without having to wrap them individually in TO_CHAR functions (or SQL scalar macros).

Another good thing happens before the function even executes: it checks to make sure the parameter is a valid identifer. This is an automatic protection against SQL injection.

The Bad and the Ugly

There is a lot to say here, sadly. I’m not sure what is “bad” and what is “ugly”: let’s just say there are quite a few things I wish were different.

No invisible columns

If the input table has invisible columns, they are not included in the column list and the final SELECT statement cannot access them, even if the final SELECT statement lists them explicitly. The only way to get access to such columns is by using a view or a named subquery that lists them explicitly.

[UPDATE 2020-12-27: the DBMS_TF.TABLE_T record has additional fields table_schema_name and table_name. We can plug those values into the SELECT statement (instead of referring to the parameter itself) and then all columns are accessible.]

Version weirdness

SQL macros were supposed to appear in 20c, and have now arrived in 21c. Table macros were back-ported to version 19, starting with 19.7. Unfortunately, 19.8 (on livesql.oracle.com) – and even 19.10 – have limited functionality compared to 21c, and even compared to the Cloud Autonomous Database which is theoretically at version 19.5!

Basically, 19.8 and 19.10 allow only scalar columns in the input table, whereas 19c “Cloud” and 21c allow many more data types. Here is the list:

LONG
In-memory DATE (e.g. SYSDATE)
LONG RAW
XMLTYPE
BFILE
JSON
UDT OBJECT
UDT NESTED TABLE
UDT VARRAY
In-memory TIMESTAMP
In-memory TIMESTAMP WITH TIME ZONE
In-memory INTERVAL_YM
In-memory INTERVAL_DS
In-memory TIMESTAMP WITH LOCAL TIME ZONE

I hope the on-premises version of 19c will eventually support all the same data types as the Cloud “version”.

Incoherent type codes

Sometimes the DBMS_TF constants define a value that is different from what the input column data actually shows.

  • CHAR / NCHAR are said to have code 96 but they return 1
  • In-memory DATE (such as SYSDATE in the SELECT list): said to be 184, is actually 13
  • In-memory TIMESTAMP WITH TIME ZONE: said to be 186, is actually 188

This is unfortunate because the constants in the DBMS_TF package are the only detailed “documentation” we have on what types are supported.

Undocumented data types

The DBMS_TF package only documents the scalar data types that work in versions up to 19.10. Even in 21c, the JSON data type is documented but not the dozen or more other data types that work fine in both 21c and 19c “Cloud”.

Type-code 58 overload

XMLTYPE columns are assigned type-code 58, which would be fine except that 58 is also the code for ANYDATA. There is no way to distinguish these types using only the information in the input TABLE_T record.

NVARACHAR2

I’m nitpicking here, but DBMS_TF.TRACE misspells NVARCHAR2 :-)

My Presentations at #UKOUG #techfest19

I just finished my second presentation at TechFest19 in sunny Brighton, England. The conference this year is great from every point of view: venue, people, content, conversations… Thanks to those who came to my talks for your benevolent attention and excellent questions/remarks!

Both my presentations are now on SlideShare. I beg you to download them so the animations work; if you don’t, you will see a jumble.

Best wishes to all for the end of the conference and the trip home.

Regards, Stew

Make your Data Dance: UNPIVOT, PIVOT and GROUP BY extensions

JSON in Oracle database 18c and 19c

MATCH_RECOGNIZE Restrictions

Sometimes we think that certain restrictions are not documented when in fact they are. Where do we forget to look? Database Error Messages

As mentioned recently on asktom.oracle.com (thanks, Connor), all error messages and event codes are listed in $ORACLE_HOME/rdbms/mesg/oraus.msg. Searching for “pattern matching”, we read “62500 – 62549 are reserved for PATTERN MATCHING”.

So here is a link to all the MATCH_RECOGNIZE error messages: ORA-60001 to ORA-65535

As of version 19c, the range actually used is 62500 through 62521.

It was a surprise to discover that “bounded quantifiers” are limited to 200, but there it is in black and white:

“ORA-62518: MATCH_RECOGNIZE pattern is too large.”
“Permute with more than 7 elements or bounded quantifier bound greater than 200 are … currently not supported.”

I’m not suggesting we spend our lives reading database error messages, but we have to admit they are documented ;-)

Best regards,
Stew

P.S. Where are the values of UB2MAXVAL and UB4MAXVAL documented? If you know, please post a link in the comments below. I suspect UB2MAXVAL is 0XFFFF and UB4MAXVAL is 0XFFFFFFFF.

FOOTNOTE 2019-11-05 : thanks to Maxim Demenko for pointing out where UB*MAXVAL are documented. In the OCI Programmer’s Guide, at the very end of the chapter on Data Types, it says

“Throughout this guide there are references to data types like ub2 or sb4, or to constants like UB4MAXVAL. These types are defined in the oratypes.h header file, which is found in the public directory. The exact contents may vary according to the operating system that you are using.”

Backtracking

By popular request, here are my thoughts about the impact of “backtracking” on performance when using the MATCH_RECOGNIZE clause. This came up again because of a query that Jonathan Lewis wrote recently; however, I will concentrate on the theory, not the query.

Patterns

The match_recognize clause implements row pattern matching: it recognizes that a consecutive series of rows is a match for a defined pattern.

The pattern is described in the aptly named PATTERN clause. The syntax resembles a subset of the regular expression syntax. For example:

  • a regular expression pattern ‘AB’ means “find the character A immediately followed by the character B”.
  • In MATCH_RECOGNIZE, PATTERN(A B) means “find a row that meets condition A, immediately followed by a row that meets condition B”. The conditions are then described in the DEFINE clause.

Both syntaxes use two features that can lead to backtracking: quantifiers and alternation.

Quantifiers

PATTERN(A) means find exactly one row that meets condition A.

We can be more flexible in how many A rows we want:

  • A? means we want 0 or 1 A row;
  • A* means we want 0 or more rows;
  • A+ means we want 1 or more rows;
  • A{100} means we want exactly 100 rows;
  • A{3,100} means we want from 3 to 100 rows.

Notice the word “exactly” appears only once in this list. All the other quantifiers are what I’ll call indefinite: there can be more than one series of rows that match the pattern! Suppose we have 200 consecutive rows that meet condition A: the pattern A* could be met 201 different ways.

When a quantifier is indefinite, the rule is to match as many rows as possible: the quantifiers are greedy. If we add a question mark, the rule is to match as few rows as possible: the quantifier becomes reluctant.

  • A{3,100} will match 100 rows if it can.
  • A{3,100}? will match 3 rows and then stop.

Whether greedy or reluctant, indefinite quantifiers can lead to backtracking. More in a minute.

Alternation

To quote the documentation, an “alternation list is created by placing a vertical bar (|) between each regular expression. Alternatives are preferred in the order they are specified. As an example, PATTERN (A | B | C) attempts to match A first. If A is not matched, it attempts to match B. If B is not matched, it attempts to match C.”

Alternation is also indefinite: in this example, the number of rows is always the same, but the same row might meet any of three different conditions.

From now on I’ll concentrate on quantifiers, since they are much more common.

From “indefinite” to backtracking

Even though an indefinite quantifier means there is more than one correct answer, there is always one preferred answer, so what’s the big deal? Suppose there are 200 rows that meet the A condition:

  • A{1,100} returns 100 rows and
  • A{1,100}? returns 1 row.

Aha! But what if there is another condition after A?

With PATTERN(A{1,100} B), suppose there are 101 consecutive rows that meet A but not B.
A regular expression engine should find 100 A, then not find a B.
It will then backtrack, “giving back” A 100. It will then find there is no B.
It will then backtrack, “giving back” A 99. It will then find there is no B.
And so on all the way back to 1.

With PATTERN(A{1,100}? B), suppose there are 100 consecutive As followed by a B.
The engine should find 1 A, then not find a B.
It will then backtrack, adding A 2. It will then not find a B.
And so on all the way up to 100.

So backtracking does not mean “giving back” an A, it means backing up from B to A.

To summarize: backtracking can happen when an indefinite quantifier is followed by another condition. With greedy quantifiers, the worst backtracking happens when there is no match, because every possible solution must be tested. With reluctant quantifiers, backtracking may happen even when there is eventually a match.

Instrumentation?

There is one bit of instrumentation about backtracking in the explain plan. Here is a quote from Oracle development that Keith Laker sent me five years ago:

In the explain plan the specific pattern matching keywords are: MATCH RECOGNIZE (SORT | BUFFER) [DETERMINISTIC FINITE AUTO]

When the plan shows “MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO“, here “MATCH RECOGNIZE” refers to the row source for evaluating the match_recognize clause , “SORT” means the row source uses “SORT” to sort the data before running it through the state machine to find matches, and “DETERMINISTIC FINITE AUTO” means the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking. DETERMINISTIC FINITE AUTO” is desirable as the execution is efficient when there is no backtracking.

Currently we detect deterministic finite automaton by checking the state machine built: if any state has 2 or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. We don’t check the predicates associated with each transition at all. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.

For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic. We don’t check the mutual exclusiveness of the define predicates in detecting a deterministic state machine.

Conclusions

The quote from Oracle development confirms that alternation, or indefinite quantifiers followed by another condition, are possibly subject to backtracking. If we are lucky enough to see DETERMINISTIC FINITE AUTO, we know backtracking is not a problem. In testing, we should always test situations where no match is found. If there are reluctant quantifiers, we should also test situations where there is a match, but not right away.

Finally, each condition should be defined as strictly as possible, saying what it should be and also what it should not be. More than once, I have run into backtracking problems because the first condition was always true; once I defined the first condition more strictly, potential matches were eliminated much earlier and the query sped up considerably.

Hope this helps,
Stew

Predict_Clustering_Factor

Jonathan Lewis recently wrote about estimating the clustering factor of an index, taking into account the intended value of the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS. He included a function I wrote called predict_clustering_factor. Here is a corrected and improved version. Any questions or remarks about the code are welcome here.

Please refer to blog posts by Jonathan and Richard Foote for explanations of the clustering factor and the TABLE_CACHED_BLOCKS parameter.

UPDATE 2019-10-28: thanks to a comment by Viacheslav Andzhich on Jonathan’s post, I found that PLSQL_CODE_TYPE=’NATIVE’ makes the function run in two-thirds the time. He also mentioned the possibility of overflow on PLS_INTEGER, so I now use INTEGER only. Thanks Viacheslav!

Description

The function takes as input:
p_table_name: the table name
p_column_list: a comma-separated list of the column names to be indexed
p_max_table_cached_blocks: the MAXIMUM value of TABLE_CACHED_BLOCKS to be tested
(255 by default).

The function then queries the ROWIDs, sorted in index order:
– Rows are omitted if all index columns are NULL.
– In case of ties, rows are sorted by ROWID.
– Each ROWID has its “row number” portion set to 0, so it effectively becomes a block id.

The function collects the most recent p_max_table_cached_blocks block ids.
When there is no more room, it replaces the Least Recently Used block id with the current one.

lt_hits_per_RU is a table of intermediate results. The number of entries =  p_max_table_cached_blocks. When a new row has a block id among the most recent, I call that a “hit”. Every hit adds one to an entry in the result table: the index of the entry is based on how “recently used” the block id was. For example, if the current row and the previous row are in the same block, the index is one.

After all the rows are processed, the result table is transformed into a table of clustering factors: the clustering factor is the total number of rows minus all the hits up to and including the current entry.

In the output, the number of rows is equal to p_max_table_cached_blocks.
The ROWNUM provides the TABLE_CACHED_BLOCKS parameter value.

Sample of usage:

select rownum table_cached_blocks,
column_value clustering_factor
from table(predict_clustering_factor('T1','v1,rand'))

The code

ALTER SESSION SET PLSQL_CODE_TYPE='NATIVE';

create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.

The function takes as input:
- p_table_name: the table name
- p_column_list: a comma-separated list of the column names to be indexed
- p_max_table_cached_blocks: the MAXIMUM value of TABLE_CACHED_BLOCKS to be tested
  (255 by default).

The function then queries the ROWIDs, sorted in index order:
- Rows are omitted if all index columns are NULL.
- In case of ties, rows are sorted by ROWID.
- Each ROWID has its "row number" portion set to 0, so it effectively becomes a block id.

The function collects the most recent p_max_table_cached_blocks block ids.
When there is no more room, it replaces the Least Recently Used block id with the current one.

The function returns a table of clustering factors.
The number of rows is equal to p_max_table_cached_blocks.
The ROWNUM of the table is the TABLE_CACHED_BLOCKS parameter value.

Sample of usage:
  select rownum table_cached_blocks,
    column_value clustering_factor
  from table(predict_clustering_factor('T1','v1,rand'))
*/
  p_table_name in varchar2,
  p_column_list in varchar2,
  p_max_table_cached_blocks in number default 255
) return sys.odcinumberlist authid current_user is

  sql_text varchar2(4000);
  rc sys_refcursor;
  type tt_rids is table of rowid;
  lt_rids tt_rids;

  type t_block is record(
    block_id rowid,
    most_recent_hit integer
  );
  type tt_blocks is table of t_block;
  lt_blocks tt_blocks := new tt_blocks();
  l_block_id rowid;
  l_block_id_prev rowid;

  l_rn integer := 0;
  b_block_found boolean;
  l_LRU integer;
  i_LRU integer := 0;

  lt_hits_per_RU sys.odcinumberlist := new sys.odcinumberlist();
  i_hits_per_RU integer;

  function truncated_rid(p_rid in rowid) return rowid is
    l_rowid_type number;
    l_object_number NUMBER;
    l_relative_fno NUMBER;
    l_block_number NUMBER;
    l_row_number NUMBER;
  begin
    DBMS_ROWID.ROWID_INFO (
      p_rid,
      l_rowid_type,
      l_object_number,
      l_relative_fno,
      l_block_number,
      l_row_number
    );
    return DBMS_ROWID.ROWID_CREATE (
      l_rowid_type,
      l_object_number,
      l_relative_fno,
      l_block_number,
      0
    );
  end truncated_rid;

  function hits_per_RU(p_most_recent_hit in integer) return integer is
    i_hits_per_RU integer := 1;
  begin
    for i in 1..lt_blocks.count loop
      if lt_blocks(i).most_recent_hit > p_most_recent_hit then
        i_hits_per_RU := i_hits_per_RU + 1;
      end if;
    end loop;
    return i_hits_per_RU;
  end hits_per_RU;

begin
  -- Check for valid TABLE_CACHED_PARAMETER value
  if p_max_table_cached_blocks != trunc(p_max_table_cached_blocks)
  or p_max_table_cached_blocks not between 1 and 255 then
    raise_application_error(
      -20001,
      'input parameter p_max_table_cached_blocks must be an integer between 1 and 255'
    );
  end if;

  -- Initialize hits_per_RU table
  lt_hits_per_RU.extend(p_max_table_cached_blocks);
  for i in 1..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := 0;
  end loop;

  -- Execute query that mimics index
  sql_text :=
    'select rowid from '||p_table_name
    ||' where '||replace(p_column_list, ',', ' is not null or ')||' is not null'
    ||' order by '||p_column_list||', rowid';
  dbms_output.put_line('Query text: '||sql_text);
  open rc for sql_text;

  loop
    fetch rc bulk collect into lt_rids limit 10000;

    for irid in 1..lt_rids.count loop

      l_rn := l_rn + 1;
      l_block_id := truncated_rid(lt_rids(irid));

      -- Optimized treatment of first row
      if l_rn = 1 then
        lt_blocks.extend;
        lt_blocks(1).block_id := l_block_id;
        lt_blocks(1).most_recent_hit := l_rn;
        l_block_id_prev := l_block_id;
        continue;
      end if;

      -- Optimized treatment of consecutive rows in same block
      if l_block_id = l_block_id_prev then
        lt_hits_per_RU(1) := lt_hits_per_RU(1) + 1;
        continue;
      end if;

      l_block_id_prev := l_block_id;
      l_LRU := l_rn;
      b_block_found := false;

      for i in 1..lt_blocks.count loop

        -- if the new block_id is never found,
        -- i_LRU will point to the Least Recently Used block
        if l_LRU > lt_blocks(i).most_recent_hit then
          l_LRU := lt_blocks(i).most_recent_hit;
          i_LRU := i;
        end if;

        -- if the new block_id is found,
        -- then how many blocks ago was it found?
        if lt_blocks(i).block_id = l_block_id then
          b_block_found := true;
          -- how recently used is the block?
          i_hits_per_RU := hits_per_RU(lt_blocks(i).most_recent_hit);
          -- update hit summary
          lt_hits_per_RU(i_hits_per_RU) := lt_hits_per_RU(i_hits_per_RU) + 1;
          -- the block_id was just hit, so update most_recent_hit value
          lt_blocks(i).most_recent_hit := l_rn;
          exit;
        end if;

      end loop;

      -- If new block_id, add to lt_blocks if there is room,
      -- otherwise overwrite Least Recently Used entry
      if not b_block_found then
        if lt_blocks.count <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>< p_max_table_cached_blocks then
          lt_blocks.extend;
          lt_blocks(lt_blocks.count).block_id := l_block_id;
          lt_blocks(lt_blocks.count).most_recent_hit := l_rn;
        else
          lt_blocks(i_LRU).block_id := l_block_id;
          lt_blocks(i_LRU).most_recent_hit := l_rn;
        end if;
      end if;

    end loop;

    exit when rc%notfound;
  end loop;

  close rc;

-- Prepare output
  -- calculate cumulative sum of hits
  for i in 2..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := lt_hits_per_RU(i) + lt_hits_per_RU(i-1);
  end loop;
  -- subtract cumulative hits from total number of rows to get
  -- clustering factor. ROWNUM provides the TABLE_CACHED_BLOCKS value.
  for i in 1..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := l_rn - lt_hits_per_RU(i);
  end loop;

  dbms_output.put_line('Total number of rows in index = '|| l_rn);
  return lt_hits_per_RU;

exception when others then
  if rc%isopen then
    close rc;
  end if;
  raise;

end predict_clustering_factor;
/

LISTAGG() as a Cumulative Function

LISTAGG() can be used as an analytic function, but cannot be cumulative. What does that mean, and is there a workaround?

I’ll demonstrate with the boring old DEPT and EMP tables, just because they have so few rows, and I’ll keep only the department and employee names:

create or replace view denames as
select dname, ename from dept join emp using(deptno);

select * from denames
order by dname, ename;
DNAME ENAME
ACCOUNTING CLARK
ACCOUNTING KING
ACCOUNTING MILLER
RESEARCH ADAMS
RESEARCH FORD
RESEARCH JONES
RESEARCH SCOTT
RESEARCH SMITH
SALES ALLEN
SALES BLAKE
SALES JAMES
SALES MARTIN
SALES TURNER
SALES WARD

 

Reminder: aggregate versus analytic

“Aggregate functions return a single result row based on groups of rows, rather than on single rows.” (19c doc)

select dname,
  listagg(ename, ',') within group(order by ename) enames
from denames
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

By contrast, “analytic functions…differ from aggregate functions in that they return multiple rows for each group…All joins and all WHERE, GROUP BY, and HAVING clauses are completed before the analytic functions are processed.” (19c). In other words, the output rows are already determined and any analytic function simply adds a column to each existing row. The “analytic clause”, starting with the keyword OVER(), says what rows provide data for the function.

select de.*,
  listagg(ename, ',') within group(order by ename)
    over(partition by dname)
  enames
from denames de;
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK,KING,MILLER
ACCOUNTING KING CLARK,KING,MILLER
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH FORD ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH JONES ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES BLAKE ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES JAMES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES MARTIN ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES TURNER ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES WARD ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

LISTAGG() Limitations

The “analytic clause” has three parts:

  1. PARTITION BY: like GROUP BY, separates the data into chunks based on one or more values;
  2. ORDER BY: within the “partition”, orders the rows and sets up the next part;
  3. Windowing clause: says what rows within the ordered partition are concerned by the function.

Some analytic functions allow only the first or second parts, and LISTAGG() is one of them. Let’s try to use it as a cumulative function:

select de.*,
  listagg(ename, ',') within group(order by ename)
    over(partition by dname order by ename)
  enames
from denames de;

SQL Error: ORA-30487: ORDER BY not allowed here
30487. 00000 -  "ORDER BY not allowed here"
*Cause:    DISTINCT functions and RATIO_TO_REPORT cannot have an ORDER BY

The error message is a bit strange, but the conclusion is: it doesn’t work.

The workaround

The idea is simple: take what LISTAGG() gives us and do a SUBSTR() based on the row number:

select dname, ename,
  substr(enames, 1, instr(enames, ',', 1, rn) - 1) enames
from (
  select de.*,
    listagg(ename, ',') within group(order by ename)
      over(partition by dname)
    || ','
    enames,
    row_number() over(partition by dname order by ename) rn
  from denames de
);
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN ALLEN
SALES BLAKE ALLEN,BLAKE
SALES JAMES ALLEN,BLAKE,JAMES
SALES MARTIN ALLEN,BLAKE,JAMES,MARTIN
SALES TURNER ALLEN,BLAKE,JAMES,MARTIN,TURNER
SALES WARD ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

Caveats

This solution seems fine, as long as the source values are NOT NULL. If there is a NULL value, the “cumulative value” becomes NULL. Look at the last row of the output.

update emp set ename = null where ename = 'ALLEN';

(repeat the query)
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES BLAKE BLAKE
SALES JAMES BLAKE,JAMES
SALES MARTIN BLAKE,JAMES,MARTIN
SALES TURNER BLAKE,JAMES,MARTIN,TURNER
SALES WARD BLAKE,JAMES,MARTIN,TURNER,WARD
SALES

 

Also, with version 19c LISTAGG() allows the DISTINCT keyword. If you want to use DISTINCT, then instead of ROW_NUMBER() use DENSE_RANK()!

update emp set ename = 'MARTIN' where ename is null;

select dname, ename,
  substr(enames, 1, instr(enames, ',', 1, rn) - 1) enames
from (
  select de.*,
    listagg(DISTINCT ename, ',') within group(order by ename)
      over(partition by dname)
    || ','
    enames,
    DENSE_RANK() over(partition by dname order by ename) rn
  from denames de
);
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES BLAKE BLAKE
SALES JAMES BLAKE,JAMES
SALES MARTIN BLAKE,JAMES,MARTIN
SALES MARTIN BLAKE,JAMES,MARTIN
SALES TURNER BLAKE,JAMES,MARTIN,TURNER
SALES WARD BLAKE,JAMES,MARTIN,TURNER,WARD

 
With these precautions, why not use LISTAGG() as a cumulative function if you want? Hope this helps…

Making Longer Lists

For very long lists, we need the return values to be CLOBs. Alas, LISTAGG can only return VARCHAR2 lists, but XML and JSON aggregate functions can return CLOBs!

As an exercise, I’ll start from ALL_OBJECTS: I want to make a list of all fully qualified object names for each object type. To start with, a simple subquery:

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select * from data;

OBJECT_TYPE  OBJ
INDEX	     SYS.I_FILE#_BLOCK#
INDEX	     SYS.I_OBJ3
INDEX	     SYS.I_TS1
INDEX	     SYS.I_CON1
...

XML

XML functions work on the XMLTYPE data type, which has practically no length limitations. To make my list, I have to go through three steps:

  1. Make an XML fragment out of each object.
    This step will escape any special characters such as '<' or '>'.
  2. Aggregate the XML fragments.
  3. Extract the text from the aggregated XML data.
    This step will unescape the special characters.

I have to handle myself the comma that separates the elements of the list.

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(O, ',' || obj)
      order by obj
    ) AS CLOB
  ), 2
) qualified_objects
from data
group by object_type
order by object_type;

On my system with about 78000 objects, this takes 3 seconds and a bit.

JSON

JSON support in the Oracle database is pretty new. For full support of CLOBs, we need versions 18c or later (or a patched version 12.2, but I don’t know the precise version number).

The steps are a little different with JSON:

  1. I can aggregate the data directly into a JSON array.
    This escapes any special characters like double quotes, newlines, etc.
  2. To turn the multiple array items into one (which is my list), I replace "," (which separates the array items) with my separator.
  3. I then use JSON_VALUE to extract the list from the array.
    This unescapes any character that needs it.
with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
  json_value(
    replace(
      json_arrayagg(obj order by obj returning clob),
      '","',
      ','
    ),
    '$[0]' returning clob
  ) qualified_objects
from data
group by object_type
order by object_type;

Instead of 3+ seconds, this code takes a bit over 1 second, so three times faster.

As a reminder, JSON is great for going the other way: split the items of a list into a table. See Splitting Strings: a New Champion!

Interval expressions

I just learned about these critters thanks to an ODC forum question: how to calculate the difference between two dates in calendar years and months, days, hours, minutes and seconds.

What are they, and why are they?

Take any two datetime values (DATE, TIMESTAMP, TIMESTAMP WITH (LOCAL) TIME ZONE) and subtract one from the other: what do you get?

  • With two dates, you get the number of days.
  • With two timestamps, you get an INTERVAL DAY TO SECOND.

If you always want the interval, just put the two values in parentheses and add DAY TO SECOND. You can add YEAR TO MONTH instead to get an INTERVAL YEAR TO MONTH.

All the following combinations are OK:

select (sysdate - sysdate) day to second from dual;
select (sysdate - sysdate) year to month from dual;
select (sysdate - systimestamp) day to second from dual;
select (sysdate - systimestamp) year to month from dual;
select (systimestamp - sysdate) day to second from dual;
select (systimestamp - sysdate) year to month from dual;
select (systimestamp - systimestamp) day to second from dual;
select (systimestamp - systimestamp) year to month from dual;

Interval expressions are a handy way to get intervals (especially year-month intervals) from all datetime datatypes with the same code.

INTERVAL YEAR TO MONTH

INTERVAL YEAR TO MONTH is based on calendar years and months, so the “same” interval may cover a different number of days depending on the context. For example, adding INTERVAL '1' MONTH  to January 1 will add 31 days, but adding it to February 1, 2019 will add 28 days. In fact, sometimes it won’t add any days at all! Tell Oracle to add 1 month to January 31 and it will just say February 31 doesn’t exist. (Actually the error message is not quite that specific…).

The other surprise is rounding: since both year and month portions are integers, the month portion is rounded according to a rule I don’t fully understand:

  • (DATE '2016-02-29' - DATE '2016-02-15') year to month is 0 months, whereas
  • (DATE '2016-03-01' - DATE '2016-02-16') year to month is 1 month.

Keeping all this in mind, we can still use intervals and interval expressions to calculate the difference between two datetime values in calendar years and months, days, hours, minutes and seconds.

The answer, step by step

To split things up into smaller portions that are easier to understand, I’ll start with a view called QUESTION. We can change this view whenever we want different test data. Here I’ll test all possible date-only ranges (all time portions are 00:00:00) in the years 2015 through 2017.

create or replace view question as  
with dates(dte) as (  
  select date '2014-12-31' + level  
  from dual  
  connect by level <= 365+366+365  
)  
, ranges(date_from, date_to) as (  
  select a.dte, b.dte  
  from dates a, dates b  
  where a.dte < b.dte  
)  
select * from ranges;

My first draft of an answer is straightforward:

  • The year-month interval goes from the first day of the “from” month to the first day of the “to” month.
  • The day-second interval =
    • the day-second interval from date_from to date_to
    • minus the day-second interval from the first day of the “from” month to the first day of the “to” month.
create or replace view answer as  
select date_from, date_to,  
  (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month as iym,  
  (date_to - date_from) day(7) to second -  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
  as ids  
from question;

This solution lets us do “round trips”: if we have date_from, iym and ids, we can calculate date_to, and if we have date_to, iym and ids we can calculate date_from.

select a.*,  
  case when ids < interval '0' second  
    then date_from + ids + iym  
    else date_from + iym + ids  
  end new_to,  
  case when ids < interval '0' second  
    then date_to - iym - ids  
    else date_to - ids - iym  
  end new_from  
from answer a;

The only problem is that IDS, which is the day-second interval, is negative almost half the time. We can use that fact to calculate properly, but users will find it strange.

In order to satisfy the users, here is a second draft:

  • When IDS is negative
    • subtract 1 from IYM
    • add to IDS the number of days in the “from” month

If we want to avoid ambiguous results, we still have to say when the original IDS was negative.
To simplify “round trips”, I decided to add an IDS_ADJUSTMENT column with the number of days in the “from” month.

create or replace view adjusted_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

This gives plausible results to the users. The same DATE_FROM, IYM and IDS combination will sometimes occur for two different DATE_TO values, but the IDS_ADJUSTMENT column will allow us to do “round trips”.

select a.*,  
  case when ids_adjustment > interval '0' second  
    then date_from  
      + (ids - ids_adjustment)  
      + (iym + interval '1' month)  
    else date_from + iym + ids  
  end new_to,  
  case when ids_adjustment > interval '0' second  
    then date_to  
      - (iym + interval '1' month)  
      - (ids - ids_adjustment)  
    else date_to - ids - iym  
  end new_from      
from adjusted_answer a;

Again, almost half the ranges will have a positive IDS_ADJUSTMENT value. We can eliminate that when the “from” month and the month preceding date_to have the same number of days. (See lines 11,12 and 24.)

create or replace view final_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days,  
    (trunc(date_to,'mm') - (trunc(date_to,'mm') - interval '1' month)) day to second  
    as before_to_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second and from_month_days != before_to_month_days  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

Now IDS_ADJUSTMENT will only be positive 25% of the time (with this test data!) We can do “round trips” the same way as with the “adjusted answer”.

As a bonus, thanks to “interval expressions” we can change the data types in the QUESTION view and the solution will still work with no changes to the code.

Conclusion

We can calculate a sensible difference (including years and months) between two dates, but unless we provide additional information we cannot calculate one date using the other date and the difference.

Making Lists

Strings that contain multiple values violate first normal form, which mandates “one value per column per row”.  People still ask how to make or massage lists of values. This post is about aggregating values into lists.

Rewrite 2019-07-11: I am now using DNAME instead of DEPTNO so that the JOIN makes sense. Also Tony Hasler pointed out that JOIN before MODEL does not need an inline view.

(To split lists into little tables, see my series on “Splitting strings”.)

Chunks of lists

(For a discussion of this topic that has influenced my solutions, see Managing overflows in LISTAGG by Keith Laker, with solutions provided by Lei Sheng.)

Version 11.2 introduced LISTAGG, which concatenates values with or without a delimiter. For example:

select dname,
  listagg(ename, ',') within group(order by ename) enames
from dept join emp using(deptno)
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 
Unfortunately, the return value cannot be a CLOB; if it is too long, we get an ORA-01489 exception. Version 12.2 provides the ON OVERFLOW clause that truncates the return value, so no error but we don’t get all the values back.

Even before version 12.2, we could get all the values if we divided the list into chunks with a valid maximum length.

Starting with 12.1 and using the MATCH_RECOGNIZE clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select dname, ename from dept join emp using(deptno)
)
match_recognize(
  partition by dname order by ename
  measures match_number() chunk
  all rows per match
  pattern(s b*)
  define b as sum(lengthb(ename)) + count(b.*) <= 20
)
group by dname, chunk
order by dname, chunk;
DNAME CHUNK ENAMES
ACCOUNTING 1 CLARK,KING,MILLER
RESEARCH 1 ADAMS,FORD,JONES
RESEARCH 2 SCOTT,SMITH
SALES 1 ALLEN,BLAKE,JAMES
SALES 2 MARTIN,TURNER,WARD

 

Notice I am using LENGTHB to calculate the length in bytes of the input strings. COUNT(B.*) adds the lengths of the delimiters, assuming that the delimiter is one byte long. Lei uses LENGTHB even on the delimiter.

Before 12.1 we can get there with the MODEL clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select * from dept join emp using(deptno)
  model
    partition by (dname)
    dimension by (row_number() over(partition by dname order by ename) rn)
    measures(ename, lengthb(ename) length_so_far, 1 chunk)
    rules(
      length_so_far[rn>1] = lengthb(ename[cv()]) +
        case
          when lengthb(ename[cv()]) + length_so_far[cv()-1] + 1 <= 20
          then length_so_far[cv()-1] + 1
          else 0
        end,
      chunk[rn>1] = chunk[cv()-1] +
        case
          when length_so_far[cv()] = lengthb(ename[cv()])
          then 1
          else 0
        end    
    )
)
group by dname, chunk
order by dname, chunk;
  • Lines 10-15 : the first RULE calculates the running length of the list that would be made, starting over once the list gets past the limit of 20 bytes.
  • Lines 16-21 : the second RULE increases the CHUNK number whenever the running length starts over.
  • Finally, LISTAGG groups by DNAME and CHUNK.

But I want just one long string!

In that case, you need a CLOB. For that, you don’t even need version 11.2, just version 11.1.

select dname,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || ename)
      order by ename
    ) AS CLOB  -- AS VARCHAR2(4000) would be the equivalent of LISTAGG
  ), 2
) enames
from dept join emp using(deptno)
group by dname
order by dname;