Splitting Strings: PL/SQL

Marc Bleron and Todd Hershiser gave me some very valuable feedback through their comments on my recent “Splitting Strings” posts. The big news is: PL/SQL beats SQL!


XQUERY provides some builtin functions prefixed by “fn”. fn:tokenize is equivalent to ora:tokenize except that it doesn’t break when the delimiter is absent from the input string: it just returns the string. Marc says fn:tokenize is supported, and it does work in, but the official documentation says “not supported”. I have asked Marc for more information.

This is not a big deal. With fn:tokenize, there would simply be a bit less typing.

Escaping the delimiter

Todd pointed out that tokenize splits a string based on a regular expression, and some common delimiters (like ‘|’) have special meaning in regular expressions. As a result, I now recommend to simply escape the delimiter with a backslash ‘\’.


Todd provided a string splitting function in PL/SQL and claimed it is clearly faster than ora:tokenize. He is right!

I wrote a function similar to his and compared it to the “tokenize” solution. Here is the function:

create or replace function string_tokenize(
  p_string in varchar2,
  p_delim in varchar2
return sys.odcivarchar2list pipelined
  i_prev_pos integer := 1;
  i_pos integer;
  i_max_pos integer := length(p_string) + 1;
  i_delim_length integer := length(p_delim);
    i_pos := instr(p_string, p_delim, i_prev_pos);
    if i_pos = 0 then
      i_pos := i_max_pos;
    end if;
    pipe row(substr(p_string, i_prev_pos, i_pos - i_prev_pos));
    exit when i_pos = i_max_pos;
    i_prev_pos := i_pos + i_delim_length;
  end loop;
end string_tokenize;

By the way, I tested this function with and without the PRAGMA UDF clause introduced in 12. I found no difference in performance in this case. Here is my final test harness:

set serveroutput on
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_tokenize_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
    execute immediate p_sql;
  end do;
  select count(*) into l_num from user_tables where table_name = 'T';
  if l_num > 0 then
    do('drop table t purge');
  end if;
  do('create table t(id number, str varchar2(4000)) cache');
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  dbms_output.put_line('Substrings' || chr(9) || 'tokenize' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, xmltable(
        'if (contains($X,",")) then ora:tokenize($X,"\,") else $X' 
        passing str as X columns subs varchar2(4000) path '.')
    l_tokenize_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_tokenize_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
  end loop;

Notice that I keep the same number of input rows here, whereas in my previous tests I kept the same number of output rows. My “tokenize” solution scales OK, but the PL/SQL function is much faster and scales even better.

In this case a combined SQL + PL/SQL solution beats the best pure SQL solution.

7 thoughts on “Splitting Strings: PL/SQL

  1. Hurray, PL/SQL! :-)

    Great post, thanks for this in-depth analysis, Stew.

    In terms of UDF not making much of a difference, you are only calling it once – for every iteration in the test – within the TABLE operator.

    So I wouldn’t expect you see much of a difference, compared, say, to involving a PL/SQL function the WHERE clause and having it executed thousands of times or more.

    • You’re welcome, Kim!

      I am happy to see that my function is not too different from the one on page 35 of your presentation ;-)

      I didn’t know your presentation was already on slideshare. A good read ahead…

  2. Interesting, as usual.

    Obviously, the regexp stuff used behind the scenes by ora:tokenize does a lot more work than just looking for a simple delimiter.
    It’s still relatively fast compared to other approaches because, internally, the CBO rewrites ora:tokenize to the built-in kernel function SYS_XQTOKENIZE, as part of an optimization process called “XQuery Rewrite”.
    Sometimes, this rewrite is good, sometimes not. It can be disabled via the NO_XML_QUERY_REWRITE hint.
    I’ve run your test case with this additional hint, and that makes tokenize a little faster (at the expense of more PGA used), but still 3x slower than PL/SQL though.

  3. Great demonstration.

    And it should be straightforward to extend the PL/SQL code so that it can handle CSV quote delimited fields, too.

  4. Great demonstration, thanks for this in-depth analysis, Stew.

    I have one question here,

    For example, I have a CLOB column which contains key value pair data with pipe delimiter like below.

    seq_id, col1|value1|col2|value2|col3|value3|col4|value4|……|coln|valuen|

    I want to split key value files into another temp table like below.

    ID columns Values
    seq_id col1 value1
    seq_id col2 value2
    seq_id col3 value3
    seq_id col4 value4
    seq_id coln valuen

    Thanks in advance.

    • create table t(c clob);
      insert into t
      select replace(
      listagg('ID'||level || '|' || 'value '||level, ','), ',' , '|')
      from dual
      connect by level <= 9;

      select key, val from (
      select rn, key_value key,
      lead(key_value) over(order by rn) val
      from (
      select rownum rn, b.column_value key_value
      from t,
      table(pipe_clob( c, 4000, '|')) a,
      table(string_tokenize(a.column_value,'|')) b
      where mod(rn,2) = 1;

      KEY VAL
      ID1 value 1
      ID2 value 2
      ID3 value 3
      ID4 value 4
      ID5 value 5
      ID6 value 6
      ID7 value 7
      ID8 value 8
      ID9 value 9

      Regards, Stew

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s