Splitting strings before they’re CLOBs!

After I tokenized two CLOBs in response to a forum question, Andy Sayer mentioned that they were probably files first. So I smacked my forehead and used external tables. When the light bulb went on, I saw that the comma needed to be a record delimiter, not a field separator. Easy.

create table item_ids_EXT(  
  "RN" NUMBER,  
  "ITEM_ID" VARCHAR2(16)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_ID" char(255)  
    )  
  )  
  location ('i_item_id.txt')  
)  
REJECT LIMIT 10;

create table item_keys_EXT(  
  "RN" NUMBER,  
  "ITEM_KEY" VARCHAR2(255)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_KEY" char(255)  
    )  
  )  
  location ('i_item_key.txt')  
)  
REJECT LIMIT 10;  

set timing on

create table u as  
select item_id, item_key  
from item_ids_ext  
join item_keys_ext using(rn);

Table U created.

Elapsed: 00:00:00.787

select floor(2 * 60 * 60 / 0.787) times_faster from dual;

TIMES_FASTER
------------
        9148
Advertisements

Splitting strings when they are CLOBs

Like many of you, I have enjoyed using SQL to make data processing go hundreds of times faster. My new record is: 8000 times faster :-)

In a recent ODC thread, the question was how to split out comma separated values from two CLOBs, one containing “item ids” and the other “item keys”. The ids and the keys had to be matched by position: the first id with the first key, the second id with the second key, and so on. Then each id/key pair was to be inserted as a row into a target table.

With 50,000 ids of 7 characters and 50,000 keys of 172 characters, the current processing takes 2 hours.

Test data

First I’ll create a table with the two CLOBs.

--drop table t purge;
create table t as
select 
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || to_char(level,'fm0000009'))
      order by level
    ) AS CLOB
  ), 2
) i_item_id,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || rpad(to_char(level,'fm0000009'),172,'*'))
      order by level
    ) AS CLOB
  ), 2
) i_item_key
from dual
connect by level <= 50000;

Splitting CLOB strings

Instead of seeking a “perfect” custom-made solution, I just put together two functions I have already written about:

  1. The PIPE_CLOB function: a pipelined table function that reads a CLOB and outputs up to 4000 bytes at a time, respecting a delimiter.
  2. The STRING_TOKENIZE function: a pipelined table function that splits a VARCHAR2 string according to a delimiter.

All I have to do is feed the output from the first function into the second, and voilà!

select b.column_value item_id
from t,
table(pipe_clob( i_item_id, 4000, ',')) a,
table(string_tokenize(a.column_value,',')) b;

Combining the two CLOBs

Remember, the requirement was to combine the values from both CLOBs by position, in order to insert them into a target table. Since the functions return the values in order, I just use ROWNUM to number the values and use that number to join them.

--drop table u purge;
create table u as
with item_ids as (
  select rownum rn,
    b.column_value item_id
    from t,
  table(pipe_clob( i_item_id, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
, item_keys as (
  select rownum rn,
    b.column_value item_key
    from t,
  table(pipe_clob( i_item_key, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
select item_id, item_key 
from item_ids join item_keys using(rn);

Table U created.

Elapsed: 00:00:00.879

That’s right, processing time went from two hours to less than one second.

Now, I suspect that the questioner’s code is doing more than just a simple insert, but until he or she provides more details about the requirement, I get to maintain my claim:

8000 times faster!

(which I hope applies only to my code, not to the speed at which time flies…)

CSV from CLOB with field enclosures

After my post about extracting CSV without enclosures from a CLOB, here is my solution for CSV with enclosures. It wasn’t easy…

How Oracle parses CSV

We can parse CSV in files using SQL*Loader, but I prefer External Tables with the ORACLE_LOADER access driver. Suppose an External Table has these parameters:

records delimited by NEWLINE
SKIP 0
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' NOTRIM
(or FIELDS CSV WITHOUT EMBEDDED NOTRIM in 12.2)

From the documentation and testing, what I see is this:

  1. With these parameters, we cannot embed record delimiters within quotes.
  2. When the field has no enclosures, all characters between commas are output, whitespace or not.
  3. When there are enclosures:
    1. Whitespace is allowed and stripped before and after enclosed fields, even with NOTRIM
    2. field terminators can be embedded between the enclosures
    3. record delimiters cannot be embedded: they end the record
    4. to be enclosed, field enclosures must be doubled: they are undoubled in the output
    5. Whitespace characters are: all characters considered whitespace by REGEXP ‘\s’
      9-13, 32 (space) – and 17 others if AL32UTF8!
    6. No second enclosure > error
    7. No delimiter between enclosed fields > error
    8. If there is a field delimiter at the end of a record, it ends the field but does not start a new field.

I admit that 3.A. (whitespace OK before and after enclosures) was a surprise to me.

My objective

I want to emulate the parameters listed above but I also want decent performance. For performance reasons, I decided to simplify the rules I follow:

  • 3.A. : no whitespace allowed before or after enclosed fields
  • 3.E. : I don’t care whether a character is whitespace or not.
  • 3.G. : the error is now “second enclosure not followed by a field delimiter”
  • Instead of LOG and BAD files, I log errors and bad input with DBMS_OUTPUT.PUT_LINE
  • I ignore extra newlines in the input, but I don’t attempt to suppress output with all null fields.

The code

This code should probably be written in C, which allows direct access to each character in a string. However, anyone who can install a C program on the Database server can probably copy the CLOB to a file and use External Table functionality directly! To make my PL/SQL as fast as possible, I use a technique from my COBOL and C days: the GOTO. With this type of code, calling subprocedures would measurably increase CPU usage.

create or replace function parse_csv_json_array(
  p_str in varchar2,
  p_log integer default null
) return varchar2 authid current_user is
/*
Objective: minimal service for parsing CSV with enclosures, similar to:
RECORDS DELIMITED BY NEWLINE
SKIP 0
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' NOTRIM
(or FIELDS CSV WITHOUT EMBEDDED NOTRIM in 12.2)
> EXCEPT: I allow no whitespace between field enclosures and field terminators.

- record delimiter = NEWLINE, field terminator = "," and field enclosure = '"'
  all are hard coded for the moment.
- record delimiter cannot be embedded in enclosed field.
- without enclosures, everything between field terminators and / or record delimiters is output
- with enclosures:
  - no whitespace allowed before or after enclosed fields
  - field terminators can be embedded between the enclosures
  - to be enclosed, field enclosures must be doubled: they are undoubled in the output
  - no second enclosure > 'Ending field enclosure not found'
  - no terminator after enclosed field > error 'Ending field enclosure not followed by field terminator'
  - If there is a field delimiter at the end of a record, it ends the field but does not start a new field.
- Instead of LOG and BAD files, I log errors and bad input with DBMS_OUTPUT.PUT_LINE
- I ignore extra newlines in the input, but I do not try to suppress output with all null fields.
- The input contains multiple records, so create an array of arrays: one inner array per record
*/
  l_out varchar2(4000);
  i_str integer;
  max_i_str integer;
  max_i_rec integer;
  i_end integer;
  i_num_backslash integer;
begin
  if p_str = '[]' then
    return p_str;
  end if;
  i_str := 3;
  max_i_rec := 2 - 2; -- end of "preceding record", counting 2 for length of record delimiter
  max_i_str := length(p_str) - 2;
  l_out := '[["';

  <<start_record>>
  i_end := max_i_rec + 3; -- length of record delimiter + 1
  i_end := instr(p_str, '\n', i_end);
  if i_end = 0 or i_end is null then
    -- end of record and end of input
    max_i_rec := max_i_str;
  else
    max_i_rec := i_end - 1;
    -- found \n at beginning, skip
    if max_i_rec < i_str then
      i_str := i_str + 2; -- length of record delimiter
      goto start_record;
    end if;
    -- check for '\\' before n
    i_num_backslash := 0;
    while substr(p_str, i_end-i_num_backslash-1,1) = '\' loop
      i_num_backslash := i_num_backslash + 1;
    end loop;
    if mod(i_num_backslash,2) = 1 then
      -- false alert, there was '\n' in the input and json_array made it '\\n'
      goto start_record;
    end if;
  end if;

  <<start_field>>
  if substr(p_str, i_str, 2) = '\"' then
    -- enclosures, so must do one character at a time
    i_str := i_str + 2;
    goto during_enclosed;
  else
    -- no enclosures, search for end of field in record
    i_end := instr(substr(p_str, i_str, max_i_rec - i_str + 1)||',', ',');
    l_out := l_out || substr(p_str, i_str, i_end - 1);
    i_str := i_str + i_end;
    if i_str <= max_i_rec + 1 then
      -- end of field, not end of record
      l_out := l_out || '","';
      goto start_field;
    elsif max_i_rec < max_i_str then
      -- last field of record, not last record
      i_str := max_i_rec + 3;
      l_out := l_out || '"],["';
      goto start_record;
    else
      -- last field of record, last record of input
      l_out := l_out || '"]]';
      goto end_input;
    end if;
  end if;

  <<during_enclosed>>
  i_end := instr(p_str, '\"', i_str);
  if i_end = 0 or i_end > max_i_rec then
    dbms_output.put_line('Ending field enclosure not found, input follows:');
    dbms_output.put_line('<'||json_value(p_str, '$[0]')||'>');
    l_out := null;
    goto end_input;
  end if;
  l_out := l_out || substr(p_str, i_str, i_end - i_str);
  i_str := i_end + 2;
  if substr(p_str, i_str, 2) = '\"' then
    l_out := l_out || '\"';
    i_str := i_str + 2;
  elsif substr(p_str, i_str, 1) = ',' then
      l_out := l_out || '","';
      i_str := i_str + 1;
      goto start_field;
  elsif i_str > max_i_str then
    l_out := l_out || '"]]';
    goto end_input;
  elsif i_str > max_i_rec then
    l_out := l_out || '"],["';
    i_str := max_i_rec + 3;
    goto start_record;
  else
    dbms_output.put_line('Ending field enclosure not followed by field terminator, input follows:');
    dbms_output.put_line('<'||json_value(p_str, '$[0]')||'>');
    l_out := null;
    goto end_input;
  end if;
  goto during_enclosed;

  <<end_input>>
  return l_out;
end parse_csv_json_array;
/

Test results

I took five columns from DBA_OBJECTS and added one VARCHAR2(100) and one NUMBER(4). I tested with four sizes of CLOBS: 100,000 records, 200,000 records, 400,000 records and 800,000 records. I compared input with no enclosures and with all fields enclosed.

Compared to the “simple” CSV solution, this solution is about 50% to 70% slower, depending on how many fields are enclosed. However, the number of records extracted per second remains stable as volume increases.

Please let me know if this solution is of practical interest to anyone…

Read CLOBs fast with less memory


Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…

Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.

  • By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
  • Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.

Again, the intent is to return as many entire records as will fit in the “bite”. If the input has a record that is longer than the maximum bite size, then this function will raise an exception.

One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.

UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.

UPDATE 2018-06-16 and -17: bug fixes for edge cases.

UPDATE 2018-06-28: better error handling. The maximum record length (in bytes) was actually 4000 including the record delimiter. It is now 4000 excluding the record delimiter.

create or replace function pipe_clob (
  p_clob in clob,
  p_max_lengthb in integer default 4000,
  p_rec_delim in varchar2 default '
'
) return sys.odcivarchar2list pipelined authid current_user as
/*
Break CLOB into VARCHAR2 sized bites.
Reduce p_max_lengthb if you need to expand the VARCHAR2
in later processing.
Last record delimiter in each bite is not returned,
but if it is a newline and the output is spooled
the newline will come back in the spooled output.
Note: this cannot work if the CLOB contains more than
<p_max_lengthb> consecutive bytes without a record delimiter.
*/
  l_amount integer;
  l_offset integer;
  l_buffer varchar2(32767 BYTE);
  l_out varchar2(32767 BYTE);
  l_buff_lengthb integer;
  l_occurence integer;
  l_rec_delim_length integer := length(p_rec_delim);
begin
  if p_max_lengthb > 4000 then
    raise_application_error(-20001, 'Maximum record length (p_max_lengthb) cannot be greater than 4000.');
  elsif p_max_lengthb < 10 then
    raise_application_error(-20002, 'Maximum record length (p_max_lengthb) cannot be less than 10.');
  end if;
  if p_rec_delim is null then
    raise_application_error(-20003, 'Record delimiter (p_rec_delim) cannot be null.');
  end if;
  /* This version is limited to 4000 byte output, so I can afford to ask for 4001
      in case the record is exactly 4000 bytes long.
  */
  l_amount := p_max_lengthb + l_rec_delim_length;
  l_offset := 1;
  while l_amount = p_max_lengthb + l_rec_delim_length loop
    begin
      DBMS_LOB.READ (
       p_clob,
       l_amount,
       l_offset,
       l_buffer
      );
    exception when no_data_found then
      l_amount := 0;
    end;
    if l_amount = 0 then
      exit;
    elsif lengthb(l_buffer) <= p_max_lengthb then pipe row(rtrim(l_buffer, p_rec_delim)); exit; end if; l_buff_lengthb := p_max_lengthb + l_rec_delim_length; l_occurence := 0; while l_buff_lengthb > p_max_lengthb loop
      l_occurence := l_occurence + 1;
      l_buff_lengthb := instrb(l_buffer,p_rec_delim, -1, l_occurence) - 1;
    end loop;
    if l_buff_lengthb < 0 then
      if l_amount = p_max_lengthb + l_rec_delim_length then
        raise_application_error(
          -20004, 
          'Input clob at offset '||l_offset||' for lengthb '||p_max_lengthb||' has no record delimiter'
        );
      end if;
    end if;
    l_out := substrb(l_buffer, 1, l_buff_lengthb);
    pipe row(l_out);
    l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length;
    l_amount := p_max_lengthb + l_rec_delim_length;
  end loop;
  return;
end;
/

In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.