Making Lists

Strings that contain multiple values violate first normal form, which mandates “one value per column per row”.  People still ask how to make or massage lists of values. This post is about aggregating values into lists.

Rewrite 2019-07-11: I am now using DNAME instead of DEPTNO so that the JOIN makes sense. Also Tony Hasler pointed out that JOIN before MODEL does not need an inline view.

(To split lists into little tables, see my series on “Splitting strings”.)

Chunks of lists

(For a discussion of this topic that has influenced my solutions, see Managing overflows in LISTAGG by Keith Laker, with solutions provided by Lei Sheng.)

Version 11.2 introduced LISTAGG, which concatenates values with or without a delimiter. For example:

select dname,
  listagg(ename, ',') within group(order by ename) enames
from dept join emp using(deptno)
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 
Unfortunately, the return value cannot be a CLOB; if it is too long, we get an ORA-01489 exception. Version 12.2 provides the ON OVERFLOW clause that truncates the return value, so no error but we don’t get all the values back.

Even before version 12.2, we could get all the values if we divided the list into chunks with a valid maximum length.

Starting with 12.1 and using the MATCH_RECOGNIZE clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select dname, ename from dept join emp using(deptno)
)
match_recognize(
  partition by dname order by ename
  measures match_number() chunk
  all rows per match
  pattern(s b*)
  define b as sum(lengthb(ename)) + count(b.*) <= 20
)
group by dname, chunk
order by dname, chunk;
DNAME CHUNK ENAMES
ACCOUNTING 1 CLARK,KING,MILLER
RESEARCH 1 ADAMS,FORD,JONES
RESEARCH 2 SCOTT,SMITH
SALES 1 ALLEN,BLAKE,JAMES
SALES 2 MARTIN,TURNER,WARD

 

Notice I am using LENGTHB to calculate the length in bytes of the input strings. COUNT(B.*) adds the lengths of the delimiters, assuming that the delimiter is one byte long. Lei uses LENGTHB even on the delimiter.

Before 12.1 we can get there with the MODEL clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select * from dept join emp using(deptno)
  model
    partition by (dname)
    dimension by (row_number() over(partition by dname order by ename) rn)
    measures(ename, lengthb(ename) length_so_far, 1 chunk)
    rules(
      length_so_far[rn>1] = lengthb(ename[cv()]) +
        case
          when lengthb(ename[cv()]) + length_so_far[cv()-1] + 1 <= 20
          then length_so_far[cv()-1] + 1
          else 0
        end,
      chunk[rn>1] = chunk[cv()-1] +
        case
          when length_so_far[cv()] = lengthb(ename[cv()])
          then 1
          else 0
        end    
    )
)
group by dname, chunk
order by dname, chunk;
  • Lines 10-15 : the first RULE calculates the running length of the list that would be made, starting over once the list gets past the limit of 20 bytes.
  • Lines 16-21 : the second RULE increases the CHUNK number whenever the running length starts over.
  • Finally, LISTAGG groups by DNAME and CHUNK.

But I want just one long string!

In that case, you need a CLOB. For that, you don’t even need version 11.2, just version 11.1.

select dname,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || ename)
      order by ename
    ) AS CLOB  -- AS VARCHAR2(4000) would be the equivalent of LISTAGG
  ), 2
) enames
from dept join emp using(deptno)
group by dname
order by dname;
Advertisement

Splitting strings before they’re CLOBs!

After I tokenized two CLOBs in response to a forum question, Andy Sayer mentioned that they were probably files first. So I smacked my forehead and used external tables. When the light bulb went on, I saw that the comma needed to be a record delimiter, not a field separator. Easy.

create table item_ids_EXT(  
  "RN" NUMBER,  
  "ITEM_ID" VARCHAR2(16)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_ID" char(255)  
    )  
  )  
  location ('i_item_id.txt')  
)  
REJECT LIMIT 10;

create table item_keys_EXT(  
  "RN" NUMBER,  
  "ITEM_KEY" VARCHAR2(255)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_KEY" char(255)  
    )  
  )  
  location ('i_item_key.txt')  
)  
REJECT LIMIT 10;  

set timing on

create table u as  
select item_id, item_key  
from item_ids_ext  
join item_keys_ext using(rn);

Table U created.

Elapsed: 00:00:00.787

select floor(2 * 60 * 60 / 0.787) times_faster from dual;

TIMES_FASTER
------------
        9148

Splitting strings when they are CLOBs

Like many of you, I have enjoyed using SQL to make data processing go hundreds of times faster. My new record is: 8000 times faster :-)

In a recent ODC thread, the question was how to split out comma separated values from two CLOBs, one containing “item ids” and the other “item keys”. The ids and the keys had to be matched by position: the first id with the first key, the second id with the second key, and so on. Then each id/key pair was to be inserted as a row into a target table.

With 50,000 ids of 7 characters and 50,000 keys of 172 characters, the current processing takes 2 hours.

Test data

First I’ll create a table with the two CLOBs.

--drop table t purge;
create table t as
select 
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || to_char(level,'fm0000009'))
      order by level
    ) AS CLOB
  ), 2
) i_item_id,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || rpad(to_char(level,'fm0000009'),172,'*'))
      order by level
    ) AS CLOB
  ), 2
) i_item_key
from dual
connect by level <= 50000;

Splitting CLOB strings

Instead of seeking a “perfect” custom-made solution, I just put together two functions I have already written about:

  1. The PIPE_CLOB function: a pipelined table function that reads a CLOB and outputs up to 4000 bytes at a time, respecting a delimiter.
  2. The STRING_TOKENIZE function: a pipelined table function that splits a VARCHAR2 string according to a delimiter.

All I have to do is feed the output from the first function into the second, and voilà!

select b.column_value item_id
from t,
table(pipe_clob( i_item_id, 4000, ',')) a,
table(string_tokenize(a.column_value,',')) b;

Combining the two CLOBs

Remember, the requirement was to combine the values from both CLOBs by position, in order to insert them into a target table. Since the functions return the values in order, I just use ROWNUM to number the values and use that number to join them.

--drop table u purge;
create table u as
with item_ids as (
  select rownum rn,
    b.column_value item_id
    from t,
  table(pipe_clob( i_item_id, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
, item_keys as (
  select rownum rn,
    b.column_value item_key
    from t,
  table(pipe_clob( i_item_key, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
select item_id, item_key 
from item_ids join item_keys using(rn);

Table U created.

Elapsed: 00:00:00.879

That’s right, processing time went from two hours to less than one second.

Now, I suspect that the questioner’s code is doing more than just a simple insert, but until he or she provides more details about the requirement, I get to maintain my claim:

8000 times faster!

(which I hope applies only to my code, not to the speed at which time flies…)

Splitting Strings: a New Champion!

My last post on splitting strings conceded that “a combined SQL + PL/SQL solution beats the best pure SQL solution.” Wrong! I didn’t try JSON_TABLE.

Pretend it’s an array

It’s not hard to reformat a string like

'a,b,c,d,e,"'

to be a JSON array:

select replace(json_array('a,b,c,d,e,"'), ',', '","') jstr
from dual;

JSTR
--------------------------
["a","b","c","d","e","\""]

Notice that the double quote in the data itself is properly escaped by the call to JSON_ARRAY. Then I turn the string into a series of array elements by putting double quotes around each comma. Once all that is done, JSON_TABLE will split it like this:

select subs from json_table(
  replace(json_array('a,b,c,d,e,"'), ',', '","'),
  '$[*]' columns (
    subs varchar2(4000) path '$'
  )
);

SUBS
----
a
b
c
d
e
"

So the escaped data is automatically unescaped. Now for speed, here is my test harness:

[Update 2022-12-27: the code for the test harness only worked if you first ran the code from a previous post. It should now work all by itself. 2023-01-02: the test harness now correctly puts JSON_ARRAY around the input string.]

set serveroutput on

drop table t purge;

create table t(id number, str varchar2(4000)) cache nologging;

create or replace function string_tokenize(
  p_string in varchar2,
  p_delim in varchar2
)
return sys.odcivarchar2list pipelined
as
  i_prev_pos integer := 1;
  i_pos integer;
  i_max_pos integer := length(p_string) + 1;
  i_delim_length integer := length(p_delim);
begin
  loop
    i_pos := instr(p_string, p_delim, i_prev_pos);
    if i_pos = 0 then
      i_pos := i_max_pos;
    end if;
    pipe row(substr(p_string, i_prev_pos, i_pos - i_prev_pos));
    exit when i_pos = i_max_pos;
    i_prev_pos := i_pos + i_delim_length;
  end loop;
  return;
end string_tokenize;
/

declare
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_JSON_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
  begin
    dbms_output.put_line(p_sql);
    execute immediate p_sql;
  end do;
begin
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  commit;
  dbms_output.put_line('Substrings' || chr(9) || 'JSON' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
      
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    );
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
      
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, JSON_table(
        replace(json_array(str), ',' , '","'), '$[*]' columns(
      subs varchar2(99) path '$'
      )
    )
    );
    l_JSON_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_JSON_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
    commit;
  end loop;
end;
/

Notice that I keep the same number of input rows here and increase the number of substrings per row.

PL/SQL does great, but JSON_TABLE appears to beat it.

Splitting Strings: PL/SQL

Marc Bleron and Todd Hershiser gave me some very valuable feedback through their comments on my recent “Splitting Strings” posts. The big news is: PL/SQL beats SQL!

fn:tokenize

XQUERY provides some builtin functions prefixed by “fn”. fn:tokenize is equivalent to ora:tokenize except that it doesn’t break when the delimiter is absent from the input string: it just returns the string. Marc says fn:tokenize is supported, and it does work in 12.1.0.2, but the official documentation says “not supported”. I have asked Marc for more information.

This is not a big deal. With fn:tokenize, there would simply be a bit less typing.

Escaping the delimiter

Todd pointed out that tokenize splits a string based on a regular expression, and some common delimiters (like ‘|’) have special meaning in regular expressions. As a result, I now recommend to simply escape the delimiter with a backslash ‘\’.

PL/SQL

Todd provided a string splitting function in PL/SQL and claimed it is clearly faster than ora:tokenize. He is right!

I wrote a function similar to his and compared it to the “tokenize” solution. Here is the function:

create or replace function string_tokenize(
  p_string in varchar2,
  p_delim in varchar2
)
return sys.odcivarchar2list pipelined
as
  i_prev_pos integer := 1;
  i_pos integer;
  i_max_pos integer := length(p_string) + 1;
  i_delim_length integer := length(p_delim);
begin
  loop
    i_pos := instr(p_string, p_delim, i_prev_pos);
    if i_pos = 0 then
      i_pos := i_max_pos;
    end if;
    pipe row(substr(p_string, i_prev_pos, i_pos - i_prev_pos));
    exit when i_pos = i_max_pos;
    i_prev_pos := i_pos + i_delim_length;
  end loop;
  return;
end string_tokenize;
/

By the way, I tested this function with and without the PRAGMA UDF clause introduced in 12. I found no difference in performance in this case. Here is my final test harness:

set serveroutput on
declare
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_tokenize_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
  begin
    dbms_output.put_line(p_sql);
    execute immediate p_sql;
  end do;
begin
  select count(*) into l_num from user_tables where table_name = 'T';
  if l_num > 0 then
    do('drop table t purge');
  end if;
  do('create table t(id number, str varchar2(4000)) cache');
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  commit;
  dbms_output.put_line('Substrings' || chr(9) || 'tokenize' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
    
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    );
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
    
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, xmltable(
        'if (contains($X,",")) then ora:tokenize($X,"\,") else $X' 
        passing str as X columns subs varchar2(4000) path '.')
    );
    l_tokenize_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_tokenize_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
    commit;
  end loop;
end;
/

Notice that I keep the same number of input rows here, whereas in my previous tests I kept the same number of output rows. My “tokenize” solution scales OK, but the PL/SQL function is much faster and scales even better.

In this case a combined SQL + PL/SQL solution beats the best pure SQL solution.

Splitting Strings: Proof!

In my previous post I used XMLTABLE and ora:tokenize to split a comma delimited string. Now I’ll apply that technique to multiple rows, and show that it’s faster than other methods.

Test data

In my tests, I configure the length of the substring, the number of substrings per row and the total number of rows I should get as output. Each input string is unique because it starts with the ID of the row: this way I avoid any caching that might reduce the number of function calls.

drop table t purge;

create table t cache as
with parms as (
  select 9 str_len, 5 num_subs, 100000 num_rows from dual
)
, str_row as (
  select listagg(n,',') within group(order by n) str
  from (
    select lpad(level+1,str_len,'0') n from parms
    connect by level <= num_subs-1
  )
)
select level id,
  lpad(level,str_len,'0') ||','||str str
from parms, str_row
connect by level <= num_rows/num_subs;

select * from t where id <= 11;
ID STR
1 000000001,000000002,000000003,000000004,000000005
2 000000002,000000002,000000003,000000004,000000005
3 000000003,000000002,000000003,000000004,000000005
4 000000004,000000002,000000003,000000004,000000005
5 000000005,000000002,000000003,000000004,000000005
6 000000006,000000002,000000003,000000004,000000005
7 000000007,000000002,000000003,000000004,000000005
8 000000008,000000002,000000003,000000004,000000005
9 000000009,000000002,000000003,000000004,000000005
10 000000010,000000002,000000003,000000004,000000005
11 000000011,000000002,000000003,000000004,000000005

 

Notice the CACHE keyword when I create the table. Before my tests, I access the entire table to make sure it is all in the buffer cache.

The “substr+instr” technique

This is the technique from my “New, Improved IN Lists” post. All I need to do is apply it to multiple rows.

One way to do that is to use the 12c LATERAL() clause. If you are not yet in 12c, try
TABLE(CAST(MULTISET(...) AS SYS.ODCIVARCHAR2LIST))

select a.id, b.subs from t a,
lateral(
  select substr(
    str,
    pos + 1,
    lead(pos,1,4000) over(order by pos) - pos - 1
  ) subs
  from (
    select instr(str, ',', 1, level) pos
    from dual
    connect by
      level <= length(str) - nvl(length(replace(str, ',', '')), 0) + 1
  )
) b;

The “tokenize” technique

This one is easy to adapt to multiple rows:

select id, subs from t, xmltable(
  'if (contains($X,",")) then ora:tokenize($X,"\,") else $X'
  passing str as X
  columns subs varchar2(4000) path '.'
);

[Update 2016-08-02: in a comment, Todd Hershiser points out that the second parameter in ora:tokenize is a regex expression. In order to use a regex “metacharacter” like “|” as a delimiter, I need to escape it with a backslash. I decided to put the backslash in everywhere since it doesn’t do any harm.

On the other hand, if the delimiter is ‘&’ then this solution cannot be used.]

The “regexp_substr” technique

This technique is fairly popular, no doubt because it is concise. For multiple rows, I use the “CONNECT BY ... PRIOR SYS_GUID()” technique that I explained in Chunking tables 7: prior sys_guid().

select id, regexp_substr (str, '[^,]+', 1, level) subs
from t
connect by level <= length (regexp_replace (str, '[^,]+')) + 1
and id = prior id
and prior sys_guid() is not null;

Test Results

As you can see here, the “substr+instr” solution is slightly better than “tokenize” when there are fewer than 8 substrings per string. As the number of substrings increases, the “tokenize” solution scales much better. As for the “regexp_substr” solution, it is on average 20 times slower than “tokenize”.

Splitting Strings: Surprise!

In my post New, Improved IN Lists!, I split one string into multiple rows. Now I want to split multiple input strings – but first, I’ve rediscovered an even faster technique!

In this post I’ll introduce the technique, and in the next post I’ll do some testing and comparing.

XMLTABLE with XMLTYPE

  • (Warning: the XML services in the Oracle database have evolved over the last several versions. Some of the solutions here may not work in older versions. Also, prior to version 12c a DBA could decide not to install “XML DB” at all! I have only tested in version 12.1.0.2.)

Quite a few folks have used the XMLTABLE function to solve this problem. One way is to change the string into an XML document: there is one root node that contains one child node per substring.

var txt varchar2(20);
exec :txt := 'A,BB,CCC,DDDD,EEEEE';
select '<r><c>'||replace(:txt, ',', '</c>,<c>')||'</c></r>;' txt
from dual;
<r><c>A</c>,<c>BB</c>,<c>CCC</c>,<c>DDDD</c>,<c>EEEEE</c></r>;

Using the XPATH expression ‘/r/c/text()‘, XMLTABLE will go through the child nodes and produce one row per substring.

select subs from xmltable(
'/r/c/text()'
passing xmltype('<r><c>'||replace(:txt, ',', '</c>,<c>')||'</c></r>')
columns subs varchar2(4000) path '.'
);
SUBS
A
BB
CCC
DDDD
EEEEE

The main drawback of this solution, aside from performance, is that the input string cannot be a full 4000 bytes long.

[Update 2016-08-01: Marc Bleron rightly commented that I could use a CLOB as an input string and have way more that 4000 bytes. The limitation I mention is only true when the input string is a VARCHAR2.]

XMLTABLE with an XQUERY sequence

If I put double quotes (or single quotes) around all the substrings, then the result is an XQUERY sequence. XMLTABLE will simply output that sequence one row per item.

select '"'||replace(:txt, ',', '","')||'"' str
from dual;

"A","BB","CCC","DDDD","EEEEE"

with data as (
  select '"'||replace(:txt, ',', '","')||'"' str
  from dual
)
select xmlcast(column_value as varchar2(4000)) subs
from data, xmltable(str);
SUBS
A
BB
CCC
DDDD
EEEEE

Again, this solution breaks if the input string is too long.

Tokenize!

Why doesn’t Oracle just provide a function to split these confounded strings? It does!

Oracle XQuery function ora:tokenize lets you use a regular expression to split the input string … into a sequence of strings.

select subs from xmltable(
  'ora:tokenize($X, "\,")'
  passing :txt as X
  columns subs varchar2(4000) path '.'
);

How simple is that? Well, not as simple as I would like. The nasty Oracle developers have decided that this function should raise an exception if the delimiter is not present in the input string. In other words, ‘A,BB’ is OK but just ‘A’ will produce “ORA-19176: FORX0003: regular expression matches zero-length string”.

Marc Bleron, who has published the ora:tokenize solution, worked around this problem by concatenating an extra comma to the input string. I worked out this alternative that allows for a full 4000 byte VARCHAR2 input string:

select subs from xmltable(
    'if (contains($X,",")) then ora:tokenize($X,"\,") else $X'
  passing :txt as X
  columns subs varchar2(4000) path '.'
);

[Update 2016-08-02: in a comment on the next post, Todd Hershiser points out that the second parameter in ora:tokenize is a regex expression. In order to use a regex “metacharacter” like “|” as a delimiter, I need to escape it with a backslash. I decided to put the backslash in everywhere since it doesn’t do any harm.

[Update 2016-08-05: There is a problem with this solution if the ampersand & is involved. You have to convert it to &amp; in the input string and the delimiter.]

[Update 2022-11-14: mathguy reminded me in a comment that there are some possible delimiters that have a special meaning if preceded by a backslash. Indeed, if you use any of the following with a backslash the query will fail: double quote, single quote, 1 through 9, A, W, Z, d, s, z]

What’s Next?

In the next post I’ll compare the performance of these solutions with the substr+instr approach, and with a popular regexp_substr approach.