Splitting Strings: Proof!

In my previous post I used XMLTABLE and ora:tokenize to split a comma delimited string. Now I’ll apply that technique to multiple rows, and show that it’s faster than other methods.

Test data

In my tests, I configure the length of the substring, the number of substrings per row and the total number of rows I should get as output. Each input string is unique because it starts with the ID of the row: this way I avoid any caching that might reduce the number of function calls.

drop table t purge;

create table t cache as
with parms as (
  select 9 str_len, 5 num_subs, 100000 num_rows from dual
)
, str_row as (
  select listagg(n,',') within group(order by n) str
  from (
    select lpad(level+1,str_len,'0') n from parms
    connect by level <= num_subs-1
  )
)
select level id,
  lpad(level,str_len,'0') ||','||str str
from parms, str_row
connect by level <= num_rows/num_subs;

select * from t where id <= 11;
ID STR
1 000000001,000000002,000000003,000000004,000000005
2 000000002,000000002,000000003,000000004,000000005
3 000000003,000000002,000000003,000000004,000000005
4 000000004,000000002,000000003,000000004,000000005
5 000000005,000000002,000000003,000000004,000000005
6 000000006,000000002,000000003,000000004,000000005
7 000000007,000000002,000000003,000000004,000000005
8 000000008,000000002,000000003,000000004,000000005
9 000000009,000000002,000000003,000000004,000000005
10 000000010,000000002,000000003,000000004,000000005
11 000000011,000000002,000000003,000000004,000000005

 

Notice the CACHE keyword when I create the table. Before my tests, I access the entire table to make sure it is all in the buffer cache.

The “substr+instr” technique

This is the technique from my “New, Improved IN Lists” post. All I need to do is apply it to multiple rows.

One way to do that is to use the 12c LATERAL() clause. If you are not yet in 12c, try
TABLE(CAST(MULTISET(...) AS SYS.ODCIVARCHAR2LIST))

select a.id, b.subs from t a,
lateral(
  select substr(
    str,
    pos + 1,
    lead(pos,1,4000) over(order by pos) - pos - 1
  ) subs
  from (
    select instr(str, ',', 1, level) pos
    from dual
    connect by
      level <= length(str) - nvl(length(replace(str, ',', '')), 0) + 1
  )
) b;

The “tokenize” technique

This one is easy to adapt to multiple rows:

select id, subs from t, xmltable(
  'if (contains($X,",")) then ora:tokenize($X,"\,") else $X'
  passing str as X
  columns subs varchar2(4000) path '.'
);

[Update 2016-08-02: in a comment, Todd Hershiser points out that the second parameter in ora:tokenize is a regex expression. In order to use a regex “metacharacter” like “|” as a delimiter, I need to escape it with a backslash. I decided to put the backslash in everywhere since it doesn’t do any harm.

On the other hand, if the delimiter is ‘&’ then this solution cannot be used.]

The “regexp_substr” technique

This technique is fairly popular, no doubt because it is concise. For multiple rows, I use the “CONNECT BY ... PRIOR SYS_GUID()” technique that I explained in Chunking tables 7: prior sys_guid().

select id, regexp_substr (str, '[^,]+', 1, level) subs
from t
connect by level <= length (regexp_replace (str, '[^,]+')) + 1
and id = prior id
and prior sys_guid() is not null;

Test Results

As you can see here, the “substr+instr” solution is slightly better than “tokenize” when there are fewer than 8 substrings per string. As the number of substrings increases, the “tokenize” solution scales much better. As for the “regexp_substr” solution, it is on average 20 times slower than “tokenize”.

Advertisements

8 thoughts on “Splitting Strings: Proof!

  1. Fascinating stuff, Stew! I suggest you add these different approaches to LiveSQL.oracle.com. Also, I’d love to do a PL/SQL quiz offering these variants. I will credit you with the implementations. Does that sound OK?

  2. Interesting comparisons, but I find the tokenize approach to be more than 10x slower than the old standby method (I think from Tom Kyte) that just does string parsing. The tokenize approach also chokes on some of the very common delimiters such as the pipe symbol because of the regex going on under the hood.

    • Todd,

      Thanks for this peer review. Could you post your test code so I can confirm your results? You don’t say how you do multiple rows with the “Tom Kyte” approach.

      Best regards, Stew

      P.S. I have tested the published “Tom Kyte” solution from his blog, using both the LATERAL() technique and the CONNECT BY technique to extend it to multiple rows. It is always a bit slower than the “substr-instr” solution I tested in this post.

      As for the “pipe” and other metacharacters, you are right and I am happy you pointed out that problem. The solution is very simple and I have updated my blog post to include it.

      Thanks again.

  3. My initial testing was iterating through your 20K table and calling a function on each row to take the delimited list and return a SQL type (table of varchar2(255)). This is where I was seeing a 10+x difference.

    Here’s the test iterating through the 100K generated rows using the two approaches (the first is a pipelined function using simple parsing and the second being the tokenize):

    declare
    v_start number;
    begin
    v_start := dbms_utility.get_time();
    for x in (select x.* from t, table(f_string_table2_pipe(t.str)) x) loop
    null;
    end loop;
    dbms_output.put_line( dbms_utility.get_time() – v_start );
    v_start := dbms_utility.get_time();
    for x in (select column_value from t, xmltable(‘if (contains($X,”,”)) then ora:tokenize($X,”\,”) else $X’ passing str as X columns column_value varchar2(4000) path ‘.’)) loop
    null;
    end loop;
    dbms_output.put_line( dbms_utility.get_time() – v_start );
    end;

    where the function is:

    create or replace function f_string_table2_pipe(p_data in varchar2)
    return stringtable pipelined
    as
    v_string long := ltrim(p_data || ‘,’, ‘,’);
    v_pos pls_integer;
    begin
    loop
    v_pos := instr(v_string, ‘,’);
    exit when (nvl(v_pos, 0) = 0);
    pipe row(trim(substr(v_string, 1, v_pos – 1)));
    v_string := substr(v_string, v_pos + 1);
    end loop;

    return;
    end;
    /

    Here, I see a 2-3x performance difference on average between the two with the tokenize consistently taking longer – and it doesn’t even have to do a context switch. The parsing approach does have, as you pointed out, the 4K limitation, although in most cases I wouldn’t see that a blocker.

    Let me know if I’m missing something here or am not comparing correctly.

    In any case, this is interesting to me because we are always having to do this type of delimited value to row transformation and I appreciate your thorough coverage of this topic!

  4. Have you tried a recursive subquery approach? I’d be curious how it compares to your other test cases:

    with r(id, str, lvl, guard, pos1, pos2, subs) as (
    select id
    , str
    , 1 lvl
    , 1
    , instr(str, ‘,’,1,1)+1
    , instr(str, ‘,’,1,2)
    , substr(str,1,instr(str, ‘,’,1,1)-1)
    from t
    union all
    select id, str, lvl+1, pos1, pos2+1, instr(str, ‘,’,pos2+1,1)
    , case when pos1 > pos2 then substr(str,pos1) else substr(str,pos1,pos2-pos1) end
    from r where guard<= pos1
    )
    select id, subs from r;

  5. Pingback: Dimensional Benchmarking of String Splitting SQL | A Programmer Writes… (Brendan's Blog)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s