CSV from CLOB with field enclosures

After my post about extracting CSV without enclosures from a CLOB, here is my solution for CSV with enclosures. It wasn’t easy…

How Oracle parses CSV

We can parse CSV in files using SQL*Loader, but I prefer External Tables with the ORACLE_LOADER access driver. Suppose an External Table has these parameters:

records delimited by NEWLINE
SKIP 0
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' NOTRIM
(or FIELDS CSV WITHOUT EMBEDDED NOTRIM in 12.2)

From the documentation and testing, what I see is this:

  1. With these parameters, we cannot embed record delimiters within quotes.
  2. When the field has no enclosures, all characters between commas are output, whitespace or not.
  3. When there are enclosures:
    1. Whitespace is allowed and stripped before and after enclosed fields, even with NOTRIM
    2. field terminators can be embedded between the enclosures
    3. record delimiters cannot be embedded: they end the record
    4. to be enclosed, field enclosures must be doubled: they are undoubled in the output
    5. Whitespace characters are: all characters considered whitespace by REGEXP ‘\s’
      9-13, 32 (space) – and 17 others if AL32UTF8!
    6. No second enclosure > error
    7. No delimiter between enclosed fields > error
    8. If there is a field delimiter at the end of a record, it ends the field but does not start a new field.

I admit that 3.A. (whitespace OK before and after enclosures) was a surprise to me.

My objective

I want to emulate the parameters listed above but I also want decent performance. For performance reasons, I decided to simplify the rules I follow:

  • 3.A. : no whitespace allowed before or after enclosed fields
  • 3.E. : I don’t care whether a character is whitespace or not.
  • 3.G. : the error is now “second enclosure not followed by a field delimiter”
  • Instead of LOG and BAD files, I log errors and bad input with DBMS_OUTPUT.PUT_LINE
  • I ignore extra newlines in the input, but I don’t attempt to suppress output with all null fields.

The code

This code should probably be written in C, which allows direct access to each character in a string. However, anyone who can install a C program on the Database server can probably copy the CLOB to a file and use External Table functionality directly! To make my PL/SQL as fast as possible, I use a technique from my COBOL and C days: the GOTO. With this type of code, calling subprocedures would measurably increase CPU usage.

create or replace function parse_csv_json_array(
  p_str in varchar2,
  p_log integer default null
) return varchar2 authid current_user is
/*
Objective: minimal service for parsing CSV with enclosures, similar to:
RECORDS DELIMITED BY NEWLINE
SKIP 0
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' NOTRIM
(or FIELDS CSV WITHOUT EMBEDDED NOTRIM in 12.2)
> EXCEPT: I allow no whitespace between field enclosures and field terminators.

- record delimiter = NEWLINE, field terminator = "," and field enclosure = '"'
  all are hard coded for the moment.
- record delimiter cannot be embedded in enclosed field.
- without enclosures, everything between field terminators and / or record delimiters is output
- with enclosures:
  - no whitespace allowed before or after enclosed fields
  - field terminators can be embedded between the enclosures
  - to be enclosed, field enclosures must be doubled: they are undoubled in the output
  - no second enclosure > 'Ending field enclosure not found'
  - no terminator after enclosed field > error 'Ending field enclosure not followed by field terminator'
  - If there is a field delimiter at the end of a record, it ends the field but does not start a new field.
- Instead of LOG and BAD files, I log errors and bad input with DBMS_OUTPUT.PUT_LINE
- I ignore extra newlines in the input, but I do not try to suppress output with all null fields.
- The input contains multiple records, so create an array of arrays: one inner array per record
*/
  l_out varchar2(4000);
  i_str integer;
  max_i_str integer;
  max_i_rec integer;
  i_end integer;
  i_num_backslash integer;
begin
  if p_str = '[]' then
    return p_str;
  end if;
  i_str := 3;
  max_i_rec := 2 - 2; -- end of "preceding record", counting 2 for length of record delimiter
  max_i_str := length(p_str) - 2;
  l_out := '[["';

  <<start_record>>
  i_end := max_i_rec + 3; -- length of record delimiter + 1
  i_end := instr(p_str, '\n', i_end);
  if i_end = 0 or i_end is null then
    -- end of record and end of input
    max_i_rec := max_i_str;
  else
    max_i_rec := i_end - 1;
    -- found \n at beginning, skip
    if max_i_rec < i_str then
      i_str := i_str + 2; -- length of record delimiter
      goto start_record;
    end if;
    -- check for '\\' before n
    i_num_backslash := 0;
    while substr(p_str, i_end-i_num_backslash-1,1) = '\' loop
      i_num_backslash := i_num_backslash + 1;
    end loop;
    if mod(i_num_backslash,2) = 1 then
      -- false alert, there was '\n' in the input and json_array made it '\\n'
      goto start_record;
    end if;
  end if;

  <<start_field>>
  if substr(p_str, i_str, 2) = '\"' then
    -- enclosures, so must do one character at a time
    i_str := i_str + 2;
    goto during_enclosed;
  else
    -- no enclosures, search for end of field in record
    i_end := instr(substr(p_str, i_str, max_i_rec - i_str + 1)||',', ',');
    l_out := l_out || substr(p_str, i_str, i_end - 1);
    i_str := i_str + i_end;
    if i_str <= max_i_rec + 1 then
      -- end of field, not end of record
      l_out := l_out || '","';
      goto start_field;
    elsif max_i_rec < max_i_str then
      -- last field of record, not last record
      i_str := max_i_rec + 3;
      l_out := l_out || '"],["';
      goto start_record;
    else
      -- last field of record, last record of input
      l_out := l_out || '"]]';
      goto end_input;
    end if;
  end if;

  <<during_enclosed>>
  i_end := instr(p_str, '\"', i_str);
  if i_end = 0 or i_end > max_i_rec then
    dbms_output.put_line('Ending field enclosure not found, input follows:');
    dbms_output.put_line('<'||json_value(p_str, '$[0]')||'>');
    l_out := null;
    goto end_input;
  end if;
  l_out := l_out || substr(p_str, i_str, i_end - i_str);
  i_str := i_end + 2;
  if substr(p_str, i_str, 2) = '\"' then
    l_out := l_out || '\"';
    i_str := i_str + 2;
  elsif substr(p_str, i_str, 1) = ',' then
      l_out := l_out || '","';
      i_str := i_str + 1;
      goto start_field;
  elsif i_str > max_i_str then
    l_out := l_out || '"]]';
    goto end_input;
  elsif i_str > max_i_rec then
    l_out := l_out || '"],["';
    i_str := max_i_rec + 3;
    goto start_record;
  else
    dbms_output.put_line('Ending field enclosure not followed by field terminator, input follows:');
    dbms_output.put_line('<'||json_value(p_str, '$[0]')||'>');
    l_out := null;
    goto end_input;
  end if;
  goto during_enclosed;

  <<end_input>>
  return l_out;
end parse_csv_json_array;
/

Test results

I took five columns from DBA_OBJECTS and added one VARCHAR2(100) and one NUMBER(4). I tested with four sizes of CLOBS: 100,000 records, 200,000 records, 400,000 records and 800,000 records. I compared input with no enclosures and with all fields enclosed.

Compared to the “simple” CSV solution, this solution is about 50% to 70% slower, depending on how many fields are enclosed. However, the number of records extracted per second remains stable as volume increases.

Please let me know if this solution is of practical interest to anyone…

Advertisements

Read CLOBs fast with less memory


Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…

Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.

  • By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
  • Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.

One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.

UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.

UPDATE 2018-06-16 and -17: bug fixes for edge cases.

create or replace function pipe_clob (
  p_clob in clob,
  p_max_lengthb in integer default 4000,
  p_rec_delim in varchar2 default '
'
) return sys.odcivarchar2list pipelined authid current_user as
  l_amount integer;
  l_offset integer;
  l_buffer varchar2(4000 CHAR);
  l_out varchar2(4000 BYTE);
  l_buff_lengthb integer;
  l_occurence integer;
  l_rec_delim_length integer := length(p_rec_delim);
begin
  l_amount := p_max_lengthb;
  l_offset := 1;
  while l_amount = p_max_lengthb loop
    begin
      DBMS_LOB.READ (
       p_clob,
       l_amount,
       l_offset,
       l_buffer
      );
    exception when no_data_found then
      l_amount := 0;
    end;
    if l_amount = 0 then
      exit;
    elsif lengthb(l_buffer) < p_max_lengthb then
      pipe row(rtrim(l_buffer, p_rec_delim));
      exit;
    end if;
    l_buff_lengthb := p_max_lengthb + 1;
    l_occurence := 0;
    while l_buff_lengthb > p_max_lengthb loop
      l_occurence := l_occurence + 1;
      l_buff_lengthb := instrb(l_buffer,p_rec_delim,-1, l_occurence);
    end loop;
    if l_buff_lengthb = 0 then
      if l_amount = p_max_lengthb then
        raise_application_error(
          -20001, 
          'Input clob at offset '||l_offset||' for length '||l_amount||' has no record delimiter'
        );
      else
        l_buff_lengthb := lengthb(l_buffer)+l_rec_delim_length;
      end if;
    end if;
    l_out := substrb(l_buffer, 1, l_buff_lengthb-l_rec_delim_length);
    pipe row(l_out);
    l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length;
    l_amount := p_max_lengthb;
  end loop;
  return;
end;
/

In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.