Splitting Strings: PL/SQL

Marc Bleron and Todd Hershiser gave me some very valuable feedback through their comments on my recent “Splitting Strings” posts. The big news is: PL/SQL beats SQL!

fn:tokenize

XQUERY provides some builtin functions prefixed by “fn”. fn:tokenize is equivalent to ora:tokenize except that it doesn’t break when the delimiter is absent from the input string: it just returns the string. Marc says fn:tokenize is supported, and it does work in 12.1.0.2, but the official documentation says “not supported”. I have asked Marc for more information.

This is not a big deal. With fn:tokenize, there would simply be a bit less typing.

Escaping the delimiter

Todd pointed out that tokenize splits a string based on a regular expression, and some common delimiters (like ‘|’) have special meaning in regular expressions. As a result, I now recommend to simply escape the delimiter with a backslash ‘\’.

PL/SQL

Todd provided a string splitting function in PL/SQL and claimed it is clearly faster than ora:tokenize. He is right!

I wrote a function similar to his and compared it to the “tokenize” solution. Here is the function:

create or replace function string_tokenize(
  p_string in varchar2,
  p_delim in varchar2
)
return sys.odcivarchar2list pipelined
as
  i_prev_pos integer := 1;
  i_pos integer;
  i_max_pos integer := length(p_string) + 1;
  i_delim_length integer := length(p_delim);
begin
  loop
    i_pos := instr(p_string, p_delim, i_prev_pos);
    if i_pos = 0 then
      i_pos := i_max_pos;
    end if;
    pipe row(substr(p_string, i_prev_pos, i_pos - i_prev_pos));
    exit when i_pos = i_max_pos;
    i_prev_pos := i_pos + i_delim_length;
  end loop;
  return;
end string_tokenize;
/

By the way, I tested this function with and without the PRAGMA UDF clause introduced in 12. I found no difference in performance in this case. Here is my final test harness:

set serveroutput on
declare
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_tokenize_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
  begin
    dbms_output.put_line(p_sql);
    execute immediate p_sql;
  end do;
begin
  select count(*) into l_num from user_tables where table_name = 'T';
  if l_num > 0 then
    do('drop table t purge');
  end if;
  do('create table t(id number, str varchar2(4000)) cache');
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  commit;
  dbms_output.put_line('Substrings' || chr(9) || 'tokenize' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
    
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    );
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
    
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, xmltable(
        'if (contains($X,",")) then ora:tokenize($X,"\,") else $X' 
        passing str as X columns subs varchar2(4000) path '.')
    );
    l_tokenize_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_tokenize_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
    commit;
  end loop;
end;
/

Notice that I keep the same number of input rows here, whereas in my previous tests I kept the same number of output rows. My “tokenize” solution scales OK, but the PL/SQL function is much faster and scales even better.

In this case a combined SQL + PL/SQL solution beats the best pure SQL solution.

Advertisements

5 thoughts on “Splitting Strings: PL/SQL

  1. Hurray, PL/SQL! :-)

    Great post, thanks for this in-depth analysis, Stew.

    In terms of UDF not making much of a difference, you are only calling it once – for every iteration in the test – within the TABLE operator.

    So I wouldn’t expect you see much of a difference, compared, say, to involving a PL/SQL function the WHERE clause and having it executed thousands of times or more.

    • You’re welcome, Kim!

      I am happy to see that my function is not too different from the one on page 35 of your presentation ;-)

      I didn’t know your presentation was already on slideshare. A good read ahead…

  2. Interesting, as usual.

    Obviously, the regexp stuff used behind the scenes by ora:tokenize does a lot more work than just looking for a simple delimiter.
    It’s still relatively fast compared to other approaches because, internally, the CBO rewrites ora:tokenize to the built-in kernel function SYS_XQTOKENIZE, as part of an optimization process called “XQuery Rewrite”.
    Sometimes, this rewrite is good, sometimes not. It can be disabled via the NO_XML_QUERY_REWRITE hint.
    I’ve run your test case with this additional hint, and that makes tokenize a little faster (at the expense of more PGA used), but still 3x slower than PL/SQL though.

  3. Great demonstration.

    And it should be straightforward to extend the PL/SQL code so that it can handle CSV quote delimited fields, too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s