Marc Bleron and Todd Hershiser gave me some very valuable feedback through their comments on my recent “Splitting Strings” posts. The big news is: PL/SQL beats SQL!
XQUERY provides some builtin functions prefixed by “fn”.
fn:tokenize is equivalent to
ora:tokenize except that it doesn’t break when the delimiter is absent from the input string: it just returns the string. Marc says
fn:tokenize is supported, and it does work in 22.214.171.124, but the official documentation says “not supported”. I have asked Marc for more information.
This is not a big deal. With
fn:tokenize, there would simply be a bit less typing.
Escaping the delimiter
Todd pointed out that tokenize splits a string based on a regular expression, and some common delimiters (like ‘|’) have special meaning in regular expressions. As a result, I now recommend to simply escape the delimiter with a backslash ‘\’.
Todd provided a string splitting function in PL/SQL and claimed it is clearly faster than
ora:tokenize. He is right!
I wrote a function similar to his and compared it to the “tokenize” solution. Here is the function:
create or replace function string_tokenize( p_string in varchar2, p_delim in varchar2 ) return sys.odcivarchar2list pipelined as i_prev_pos integer := 1; i_pos integer; i_max_pos integer := length(p_string) + 1; i_delim_length integer := length(p_delim); begin loop i_pos := instr(p_string, p_delim, i_prev_pos); if i_pos = 0 then i_pos := i_max_pos; end if; pipe row(substr(p_string, i_prev_pos, i_pos - i_prev_pos)); exit when i_pos = i_max_pos; i_prev_pos := i_pos + i_delim_length; end loop; return; end string_tokenize; /
By the way, I tested this function with and without the
PRAGMA UDF clause introduced in 12. I found no difference in performance in this case. Here is my final test harness:
set serveroutput on declare l_num number; l_timestamp timestamp; l_plsql_secs number; l_tokenize_secs number; l_num_substrings number := 10; procedure do(p_sql in varchar2) is begin dbms_output.put_line(p_sql); execute immediate p_sql; end do; begin select count(*) into l_num from user_tables where table_name = 'T'; if l_num > 0 then do('drop table t purge'); end if; do('create table t(id number, str varchar2(4000)) cache'); insert into t select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010' from dual connect by level <= 10000; commit; dbms_output.put_line('Substrings' || chr(9) || 'tokenize' || chr(9) || 'PL/SQL'); for i in 1..10 loop select count(*) into l_num from t; l_timestamp := localtimestamp; select count(column_value) into l_num from ( select id, column_value from t, table(string_tokenize(str, ',')) ); l_plsql_secs := extract(second from localtimestamp - l_timestamp); l_timestamp := localtimestamp; select count(subs) into l_num from ( select id, subs from t, xmltable( 'if (contains($X,",")) then ora:tokenize($X,"\,") else $X' passing str as X columns subs varchar2(4000) path '.') ); l_tokenize_secs := extract(second from localtimestamp - l_timestamp); dbms_output.put_line(l_num_substrings || chr(9) || l_tokenize_secs || chr(9) || l_plsql_secs); update t set str = str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'; l_num_substrings := l_num_substrings + 10; commit; end loop; end; /
Notice that I keep the same number of input rows here, whereas in my previous tests I kept the same number of output rows. My “tokenize” solution scales OK, but the PL/SQL function is much faster and scales even better.
In this case a combined SQL + PL/SQL solution beats the best pure SQL solution.