Splitting strings before they’re CLOBs!

After I tokenized two CLOBs in response to a forum question, Andy Sayer mentioned that they were probably files first. So I smacked my forehead and used external tables. When the light bulb went on, I saw that the comma needed to be a record delimiter, not a field separator. Easy.

create table item_ids_EXT(  
  "RN" NUMBER,  
  "ITEM_ID" VARCHAR2(16)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_ID" char(255)  
    )  
  )  
  location ('i_item_id.txt')  
)  
REJECT LIMIT 10;

create table item_keys_EXT(  
  "RN" NUMBER,  
  "ITEM_KEY" VARCHAR2(255)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_KEY" char(255)  
    )  
  )  
  location ('i_item_key.txt')  
)  
REJECT LIMIT 10;  

set timing on

create table u as  
select item_id, item_key  
from item_ids_ext  
join item_keys_ext using(rn);

Table U created.

Elapsed: 00:00:00.787

select floor(2 * 60 * 60 / 0.787) times_faster from dual;

TIMES_FASTER
------------
        9148
Advertisements

Splitting strings when they are CLOBs

Like many of you, I have enjoyed using SQL to make data processing go hundreds of times faster. My new record is: 8000 times faster :-)

In a recent ODC thread, the question was how to split out comma separated values from two CLOBs, one containing “item ids” and the other “item keys”. The ids and the keys had to be matched by position: the first id with the first key, the second id with the second key, and so on. Then each id/key pair was to be inserted as a row into a target table.

With 50,000 ids of 7 characters and 50,000 keys of 172 characters, the current processing takes 2 hours.

Test data

First I’ll create a table with the two CLOBs.

--drop table t purge;
create table t as
select 
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || to_char(level,'fm0000009'))
      order by level
    ) AS CLOB
  ), 2
) i_item_id,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || rpad(to_char(level,'fm0000009'),172,'*'))
      order by level
    ) AS CLOB
  ), 2
) i_item_key
from dual
connect by level <= 50000;

Splitting CLOB strings

Instead of seeking a “perfect” custom-made solution, I just put together two functions I have already written about:

  1. The PIPE_CLOB function: a pipelined table function that reads a CLOB and outputs up to 4000 bytes at a time, respecting a delimiter.
  2. The STRING_TOKENIZE function: a pipelined table function that splits a VARCHAR2 string according to a delimiter.

All I have to do is feed the output from the first function into the second, and voilà!

select b.column_value item_id
from t,
table(pipe_clob( i_item_id, 4000, ',')) a,
table(string_tokenize(a.column_value,',')) b;

Combining the two CLOBs

Remember, the requirement was to combine the values from both CLOBs by position, in order to insert them into a target table. Since the functions return the values in order, I just use ROWNUM to number the values and use that number to join them.

--drop table u purge;
create table u as
with item_ids as (
  select rownum rn,
    b.column_value item_id
    from t,
  table(pipe_clob( i_item_id, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
, item_keys as (
  select rownum rn,
    b.column_value item_key
    from t,
  table(pipe_clob( i_item_key, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
select item_id, item_key 
from item_ids join item_keys using(rn);

Table U created.

Elapsed: 00:00:00.879

That’s right, processing time went from two hours to less than one second.

Now, I suspect that the questioner’s code is doing more than just a simple insert, but until he or she provides more details about the requirement, I get to maintain my claim:

8000 times faster!

(which I hope applies only to my code, not to the speed at which time flies…)

Extract from CLOB with JSON objects

On AskTOM, Kim Berg Hansen recently used JSON_OBJECT to parse CLOB data in “name=value” format. I added a variant based on my work with CSV data.

Kim decided to use objects instead of arrays for an excellent and important reason:

  • a JSON object is an unordered collection of name/value pairs.
  • a JSON array is an ordered list of values.

CSV data is an ordered list of values, going from left to right, so a JSON array is the obvious choice. The AskTOM question concerned name/value pairs that were not necessarily in the same order and not necessarily all there! A JSON object was the natural choice.

I’m really liking JSON more and more: not only is it simple, but it seems pretty easy to figure out how best to use it for different problems.

Instead of copying everything here, I’m going to be lazy for once and invite you to read my contribution here:

More on JSON_TABLE

You will see some of the same ideas:

  • Use PIPE_CLOB to cut the CLOB into VARCHAR sized bites;
  • Use JSON_ARRAY on the whole bite to escape characters if needed;
  • Use REPLACE to form an overall array, but this time of objects;
  • Then use JSON_TABLE to generate one row per object and one column per value.

Hope this helps…

Extract CSV from CLOB with JSON arrays

Marc Bleron blogged about CSV CLOBs and JSON_TABLE two years ago. Here’s my contribution to improve on a great idea.

“Simple” CSV

The Oracle Database does a fine job of parsing CSV data in flat files, using the External Tables facility. Unfortunately, this service is not available for CSV data in VARCHARs or CLOBs. Marc showed that JSON_TABLE (and XML_TABLE) can parse “simple” CSV if it is reformatted. What is “simple”?

CSV data consists of records and fields within records.

  • Records are delimited by NEWLINE (or some other string).
  • Fields are terminated by commas (or some other string),
  • If necessary, some or all fields can be enclosed by double quotes " (or some other string).

When Marc says “simple”, he means that fields are never enclosed. This is important, because enclosed fields may contain the double quote (provided it is present twice in a row) and / or the comma. With “simple” CSV, we know that all commas are true field terminators and we don’t have to replace "" with " .

“Simple” also means that there is no trimming of whitespace: you get everything between the commas.

Finally, Marc assumes there is no terminator after the last field of the record, even though Oracle allows it.

So, “simple” CSV has delimited records with terminated fields that are never enclosed. There is no trimming of whitespace and the last field in the record is not terminated.

My contribution

  • First of all, I break the CLOB into VARCHAR2 bites using the pipelined table function PIPE_CLOB (as explained in my previous post).
  • Then I remove any field terminator that immediately precedes a record delimiter.
  • Then I use JSON_ARRAY over the entire VARCHAR2 in case some characters need to be escaped.
  • Then I do several REPLACES such that:
    • each record becomes a JSON array of string values, and
    • those arrays are included in one overall array.
  • Finally, I use JSON_TABLE to break the overall array into rows and the inner arrays into columns.

Note that everything before the COLUMNS clause in JSON_TABLE is generic, because the inner arrays can contain any number of elements.

To demonstrate, here is a CLOB containing data from the EMP table, with a trailing comma added:

7369,SMITH,CLERK,7902,1980-12-17T00:00:00,800,,20,
7499,ALLEN,SALESMAN,7698,1981-02-20T00:00:00,1600,300,30,
7521,WARD,SALESMAN,7698,1981-02-22T00:00:00,1250,500,30,
7566,JONES,MANAGER,7839,1981-04-02T00:00:00,2975,,20,
7654,MARTIN,SALESMAN,7698,1981-09-28T00:00:00,1250,1400,30,
7698,BLAKE,MANAGER,7839,1981-05-01T00:00:00,2850,,30,
7782,CLARK,MANAGER,7839,1981-06-09T00:00:00,2450,,10,
7839,KING,PRESIDENT,,1981-11-17T00:00:00,5000,,10,
7844,TURNER,SALESMAN,7698,1981-09-08T00:00:00,1500,0,30,
7876,ADAMS,CLERK,7788,1987-05-23T00:00:00,1100,,20,
7900,JAMES,CLERK,7698,1981-12-03T00:00:00,950,,30,
7902,FORD,ANALYST,7566,1981-12-03T00:00:00,3000,,20,
7934,MILLER,CLERK,7782,1982-01-23T00:00:00,1300,,10,

And the code:

with last_term_removed as (
  select replace(column_value, ','||chr(10), chr(10)) str
  from table(pipe_clob((select c from t), 3500))
)
, json_data as (
  select '[' ||
    replace (
      replace(json_array(str), ',', '","'),
      '\n',
      '"],["'
    )
    || ']' jstr  
  from last_term_removed
)
select sql_data.*
from json_data j, json_table(
   j.jstr, '$[*]'
   columns empno    number        path '$[0]'
         , ename    varchar2(128) path '$[1]'
         , job      varchar2(128) path '$[2]'
         , mgr      number        path '$[3]'
         , hiredate date          path '$[4]'
         , sal      number        path '$[5]'
         , comm     number        path '$[6]'
         , deptno   number        path '$[7]'
) sql_data;
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
7369 SMITH CLERK 7902 1980-12-17 800 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 1250 500 30
7566 JONES MANAGER 7839 1981-04-02 2975 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30
7698 BLAKE MANAGER 7839 1981-05-01 2850 30
7782 CLARK MANAGER 7839 1981-06-09 2450 10
7839 KING PRESIDENT 1981-11-17 5000 10
7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30
7876 ADAMS CLERK 7788 1987-05-23 1100 20
7900 JAMES CLERK 7698 1981-12-03 950 30
7902 FORD ANALYST 7566 1981-12-03 3000 20
7934 MILLER CLERK 7782 1982-01-23 1300 10

 

Scalability: yes!

When Marc tested reading CLOBs directly, performance went bad as the CLOB increased in size:

Rows Seconds
91336 2.2
182672 4.3
365344 9.4
730688 22
1461376 840

 

In my tests with very similar data, the number of rows per second remains about the same:

LINES SECS LINES_PER_SEC AVG_LINES_PER_SEC PCT_DIFF
91336 1.374 66475 68517 -3
182672 2.6 70258 68517 2.5
365344 5.35 68289 68517 -.3
730688 10 73069 68517 6.6
1461376 22 66426 68517 -3.1

 

Split a string again and again

I recently needed to gather multiple columns from multiple rows into one string, then split them out again. JSON_TABLE did both splits at the same time!

The situation

I want to find the shortest “foreign key” path between two tables in the OBE schema: OEHR_ORDER_ITEMS and OEHR_DEPARTMENTS. I also want the intermediate tables and the constraints used to connect the tables. As a starting point for today, I’ll create a little table with the data for the tables in the OBE schema that start with ‘OEHR’:

drop table t purge;
create table t as
with search_for (owner, table_name) as (
  select 'OBE', 'OEHR_DEPARTMENTS' from dual union all
  select 'OBE', 'OEHR_ORDER_ITEMS' from dual
)
, pur as (
  select OWNER, CONSTRAINT_NAME, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner = 'OBE'
  and table_name like 'OEHR%'
)
, relations as (
  select a.table_name, a.constraint_name,
    b.table_name r_table_name, a.r_constraint_name
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)
select * from relations;

Now I’ll use a hierarchical query to find paths from one table to the other:

select rownum rn, level lvl, TABLE_NAME, CONSTRAINT_NAME, R_TABLE_NAME, R_CONSTRAINT_NAME
from t a
start with table_name = 'OEHR_ORDER_ITEMS'
connect by nocycle table_name = prior r_table_name
and table_name != 'OEHR_DEPARTMENTS';
RN LVL TABLE_NAME CONSTRAINT_NAME R_TABLE_NAME R_CONSTRAINT_NAME
1 1 OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_ORDER_ID_FK OEHR_ORDERS OEHR_ORDER_PK
2 2 OEHR_ORDERS OEHR_ORDERS_CUSTOMER_ID_FK OEHR_CUSTOMERS OEHR_CUSTOMERS_PK
3 3 OEHR_CUSTOMERS OEHR_CUST_ACCT_MGR_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
4 4 OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK
5 4 OEHR_EMPLOYEES OEHR_EMP_JOB_FK OEHR_JOBS OEHR_JOB_ID_PK
6 2 OEHR_ORDERS OEHR_ORDERS_SALES_REP_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
7 3 OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK
8 3 OEHR_EMPLOYEES OEHR_EMP_JOB_FK OEHR_JOBS OEHR_JOB_ID_PK
9 1 OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_PRODUCT_ID_FK OEHR_PRODUCT_INFORMATION OEHR_PRODUCT_INFO_PK

 

Just looking at the data myself, I can see that the shortest path will include rows 1, 6 and 7. Rows 1 through 4 will work too, but the path is longer.

Getting the data for the whole path

The problem is, how can I identify those three rows I want and just query them? The only way I see is to use SYS_CONNECT_BY_PATH (or the equivalent with a CTE) to string together all the data from all the rows I want – which means I have to split that data out again later.

To simplify what follows, I’m going to manually delete the rows that don’t matter:

delete from t where table_name not in ('OEHR_ORDER_ITEMS','OEHR_ORDERS','OEHR_EMPLOYEES');
delete from t where r_table_name not in ('OEHR_DEPARTMENTS','OEHR_ORDERS','OEHR_EMPLOYEES');
commit;

The rows that are left contain the data I want:

TABLE_NAME CONSTRAINT_NAME R_TABLE_NAME R_CONSTRAINT_NAME
OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_ORDER_ID_FK OEHR_ORDERS OEHR_ORDER_PK
OEHR_ORDERS OEHR_ORDERS_SALES_REP_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK

 

Now I’ll string that data together:

select sys_connect_by_path(
  table_name  ||'/'||  constraint_name||'/'||
  r_table_name||'/'||r_constraint_name
  ,'<'
) path
from t a
where r_table_name = 'OEHR_DEPARTMENTS'
start with table_name = 'OEHR_ORDER_ITEMS'
connect by table_name = prior r_table_name;

<OEHR_ORDER_ITEMS/OEHR_ORDER_ITEMS_ORDER_ID_FK/OEHR_ORDERS/OEHR_ORDER_PK
<OEHR_ORDERS/OEHR_ORDERS_SALES_REP_FK/OEHR_EMPLOYEES/OEHR_EMP_EMP_ID_PK
<OEHR_EMPLOYEES/OEHR_EMP_DEPT_FK/OEHR_DEPARTMENTS/OEHR_DEPT_ID_PK

I split the data into three lines for legibility. Note that I have to use different literals in the two parameters of SYS_CONNECT_BY_PATH, otherwise I get an exception.

Splitting out the data

Now we’re finally getting to the point. I can use JSON_ARRAY to string together the data for one row. Once I have all the data for all the rows, I can turn that into an array of arrays. Finally, JSON_TABLE will split everything out for me. Let’s start with JSON_ARRAY:

select sys_connect_by_path(
  json_array(table_name, constraint_name, r_table_name, r_constraint_name)
  , '
'
) path
from t a
where r_table_name = 'OEHR_DEPARTMENTS'
start with table_name = 'OEHR_ORDER_ITEMS'
connect by table_name = prior r_table_name;

["OEHR_ORDER_ITEMS","OEHR_ORDER_ITEMS_ORDER_ID_FK","OEHR_ORDERS","OEHR_ORDER_PK"]
["OEHR_ORDERS","OEHR_ORDERS_SALES_REP_FK","OEHR_EMPLOYEES","OEHR_EMP_EMP_ID_PK"]
["OEHR_EMPLOYEES","OEHR_EMP_DEPT_FK","OEHR_DEPARTMENTS","OEHR_DEPT_ID_PK"]

Now I’ll turn that data into an array of arrays:

with data as (
  select sys_connect_by_path(
    json_array(table_name, constraint_name, r_table_name, r_constraint_name)
    , '/'
  ) path
  from t a
  where r_table_name = 'OEHR_DEPARTMENTS'
  start with table_name = 'OEHR_ORDER_ITEMS'
  connect by table_name = prior r_table_name
)
select '['||substr(replace(path,'/',','),2) || ']' path
from data;
[
  [
    "OEHR_ORDER_ITEMS",
    "OEHR_ORDER_ITEMS_ORDER_ID_FK",
    "OEHR_ORDERS",
    "OEHR_ORDER_PK"
  ],
  [
    "OEHR_ORDERS",
    "OEHR_ORDERS_SALES_REP_FK",
    "OEHR_EMPLOYEES",
    "OEHR_EMP_EMP_ID_PK"
  ],
  [
    "OEHR_EMPLOYEES",
    "OEHR_EMP_DEPT_FK",
    "OEHR_DEPARTMENTS",
    "OEHR_DEPT_ID_PK"
  ]
]

Now JSON_TABLE will break down the top array into rows, just by putting ‘$[*]’ in the second parameter. The COLUMNS will do the rest, splitting the data in the inner arrays into different columns.

with data as (
  select sys_connect_by_path(
    json_array(table_name, constraint_name, r_table_name, r_constraint_name)
    , '/'
  ) path
  from t a
  where r_table_name = 'OEHR_DEPARTMENTS'
  start with table_name = 'OEHR_ORDER_ITEMS'
  connect by table_name = prior r_table_name
)
select a.* from data, json_table(
  '['||substr(replace(path,'/',','),2) || ']',
  '$[*]' columns(
    rn for ordinality,
    table_name varchar2(128) path '$[0]',
    constraint_name varchar2(128) path '$[1]',
    r_table_name varchar2(128) path '$[2]',
    r_constraint_name varchar2(128) path '$[3]'
  )
) a;
RN TABLE_NAME CONSTRAINT_NAME R_TABLE_NAME R_CONSTRAINT_NAME
1 OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_ORDER_ID_FK OEHR_ORDERS OEHR_ORDER_PK
2 OEHR_ORDERS OEHR_ORDERS_SALES_REP_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
3 OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK

 

Whew!

P.S. If you think a CTE solution is cleaner:

with data( TABLE_NAME, R_TABLE_NAME, path ) as (
  select n.TABLE_NAME, n.R_TABLE_NAME,
    json_array(n.TABLE_NAME, n.CONSTRAINT_NAME, n.R_TABLE_NAME, n.R_CONSTRAINT_NAME)
  from t n
  where table_name = 'OEHR_ORDER_ITEMS'
  union all
  select n.TABLE_NAME, n.R_TABLE_NAME,
    o.path || ',' ||
    json_array(n.TABLE_NAME, n.CONSTRAINT_NAME, n.R_TABLE_NAME, n.R_CONSTRAINT_NAME)
  from t n
  join data o on (n.table_name = o.r_table_name)
    and n.table_name != 'OEHR_DEPARTMENTS'
)
select b.* from data a, json_table(
  '['|| path || ']',
  '$[*]' columns(
    rn for ordinality,
    table_name varchar2(128) path '$[0]',
    constraint_name varchar2(128) path '$[1]',
    r_table_name varchar2(128) path '$[2]',
    r_constraint_name varchar2(128) path '$[3]'
  )
) b
where a.r_table_name = 'OEHR_DEPARTMENTS';

Splitting Strings: a New Champion!

My last post on splitting strings conceded that “a combined SQL + PL/SQL solution beats the best pure SQL solution.” Wrong! I didn’t try JSON_TABLE.

Pretend it’s an array

It’s not hard to reformat a string like

'a,b,c,d,e,"'

to be a JSON array:

select replace(json_array('a,b,c,d,e,"'), ',', '","') jstr
from dual;

JSTR
--------------------------
["a","b","c","d","e","\""]

Notice that the double quote in the data itself is properly escaped by the call to JSON_ARRAY. Then I turn the string into a series of array elements by putting double quotes around each comma. Once all that is done, JSON_TABLE will split it like this:

select subs from json_table(
  replace(json_array('a,b,c,d,e,"'), ',', '","'),
  '$[*]' columns (
    subs varchar2(4000) path '$'
  )
);

SUBS
----
a
b
c
d
e
"

So the escaped data is automatically unescaped. Now for speed, here is my test harness:

set serveroutput on
declare
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_JSON_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
  begin
    dbms_output.put_line(p_sql);
    execute immediate p_sql;
  end do;
begin
  select count(*) into l_num from user_tables where table_name = 'T';
  if l_num > 0 then
    do('drop table t purge');
  end if;
  do('create table t(id number, str varchar2(4000)) cache');
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  commit;
  dbms_output.put_line('Substrings' || chr(9) || 'JSON' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
     
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    );
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
     
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, JSON_table(
      '["' || replace(str, ',' , '","') || '"]', '$[*]' columns(
      subs varchar2(99) path '$'
      )
    )
    );
    l_JSON_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_JSON_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
    commit;
  end loop;
end;
/

Notice that I keep the same number of input rows here and increase the number of substrings per row.

PL/SQL does great, but JSON_TABLE appears to beat it.

Splitting Strings: PL/SQL

Marc Bleron and Todd Hershiser gave me some very valuable feedback through their comments on my recent “Splitting Strings” posts. The big news is: PL/SQL beats SQL!

fn:tokenize

XQUERY provides some builtin functions prefixed by “fn”. fn:tokenize is equivalent to ora:tokenize except that it doesn’t break when the delimiter is absent from the input string: it just returns the string. Marc says fn:tokenize is supported, and it does work in 12.1.0.2, but the official documentation says “not supported”. I have asked Marc for more information.

This is not a big deal. With fn:tokenize, there would simply be a bit less typing.

Escaping the delimiter

Todd pointed out that tokenize splits a string based on a regular expression, and some common delimiters (like ‘|’) have special meaning in regular expressions. As a result, I now recommend to simply escape the delimiter with a backslash ‘\’.

PL/SQL

Todd provided a string splitting function in PL/SQL and claimed it is clearly faster than ora:tokenize. He is right!

I wrote a function similar to his and compared it to the “tokenize” solution. Here is the function:

create or replace function string_tokenize(
  p_string in varchar2,
  p_delim in varchar2
)
return sys.odcivarchar2list pipelined
as
  i_prev_pos integer := 1;
  i_pos integer;
  i_max_pos integer := length(p_string) + 1;
  i_delim_length integer := length(p_delim);
begin
  loop
    i_pos := instr(p_string, p_delim, i_prev_pos);
    if i_pos = 0 then
      i_pos := i_max_pos;
    end if;
    pipe row(substr(p_string, i_prev_pos, i_pos - i_prev_pos));
    exit when i_pos = i_max_pos;
    i_prev_pos := i_pos + i_delim_length;
  end loop;
  return;
end string_tokenize;
/

By the way, I tested this function with and without the PRAGMA UDF clause introduced in 12. I found no difference in performance in this case. Here is my final test harness:

set serveroutput on
declare
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_tokenize_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
  begin
    dbms_output.put_line(p_sql);
    execute immediate p_sql;
  end do;
begin
  select count(*) into l_num from user_tables where table_name = 'T';
  if l_num > 0 then
    do('drop table t purge');
  end if;
  do('create table t(id number, str varchar2(4000)) cache');
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  commit;
  dbms_output.put_line('Substrings' || chr(9) || 'tokenize' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
    
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    );
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
    
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, xmltable(
        'if (contains($X,",")) then ora:tokenize($X,"\,") else $X' 
        passing str as X columns subs varchar2(4000) path '.')
    );
    l_tokenize_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_tokenize_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
    commit;
  end loop;
end;
/

Notice that I keep the same number of input rows here, whereas in my previous tests I kept the same number of output rows. My “tokenize” solution scales OK, but the PL/SQL function is much faster and scales even better.

In this case a combined SQL + PL/SQL solution beats the best pure SQL solution.