Read CLOBs fast with less memory


Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…

Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.

  • By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
  • Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.

Again, the intent is to return as many entire records as will fit in the “bite”. If the input has a record that is longer than the maximum bite size, then this function will raise an exception.

One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.

UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.

UPDATE 2018-06-16 and -17: bug fixes for edge cases.

UPDATE 2018-06-28: better error handling. The maximum record length (in bytes) was actually 4000 including the record delimiter. It is now 4000 excluding the record delimiter.

create or replace function pipe_clob (
  p_clob in clob,
  p_max_lengthb in integer default 4000,
  p_rec_delim in varchar2 default '
'
) return sys.odcivarchar2list pipelined authid current_user as
/*
Break CLOB into VARCHAR2 sized bites.
Reduce p_max_lengthb if you need to expand the VARCHAR2
in later processing.
Last record delimiter in each bite is not returned,
but if it is a newline and the output is spooled
the newline will come back in the spooled output.
Note: this cannot work if the CLOB contains more than
<p_max_lengthb> consecutive bytes without a record delimiter.
*/
  l_amount integer;
  l_offset integer;
  l_buffer varchar2(32767 BYTE);
  l_out varchar2(32767 BYTE);
  l_buff_lengthb integer;
  l_occurence integer;
  l_rec_delim_length integer := length(p_rec_delim);
begin
  if p_max_lengthb > 4000 then
    raise_application_error(-20001, 'Maximum record length (p_max_lengthb) cannot be greater than 4000.');
  elsif p_max_lengthb < 10 then
    raise_application_error(-20002, 'Maximum record length (p_max_lengthb) cannot be less than 10.');
  end if;
  if p_rec_delim is null then
    raise_application_error(-20003, 'Record delimiter (p_rec_delim) cannot be null.');
  end if;
  /* This version is limited to 4000 byte output, so I can afford to ask for 4001
      in case the record is exactly 4000 bytes long.
  */
  l_amount := p_max_lengthb + l_rec_delim_length;
  l_offset := 1;
  while l_amount = p_max_lengthb + l_rec_delim_length loop
    begin
      DBMS_LOB.READ (
       p_clob,
       l_amount,
       l_offset,
       l_buffer
      );
    exception when no_data_found then
      l_amount := 0;
    end;
    if l_amount = 0 then
      exit;
    elsif lengthb(l_buffer) <= p_max_lengthb then pipe row(rtrim(l_buffer, p_rec_delim)); exit; end if; l_buff_lengthb := p_max_lengthb + l_rec_delim_length; l_occurence := 0; while l_buff_lengthb > p_max_lengthb loop
      l_occurence := l_occurence + 1;
      l_buff_lengthb := instrb(l_buffer,p_rec_delim, -1, l_occurence) - 1;
    end loop;
    if l_buff_lengthb < 0 then
      if l_amount = p_max_lengthb + l_rec_delim_length then
        raise_application_error(
          -20004, 
          'Input clob at offset '||l_offset||' for lengthb '||p_max_lengthb||' has no record delimiter'
        );
      end if;
    end if;
    l_out := substrb(l_buffer, 1, l_buff_lengthb);
    pipe row(l_out);
    l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length;
    l_amount := p_max_lengthb + l_rec_delim_length;
  end loop;
  return;
end;
/

In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.

Advertisements

2 thoughts on “Read CLOBs fast with less memory

  1. Hello Stew,

    It looks to me that the pipe_clob function has a problem if it encounters records longer than 4000 characters.

    A small example:

    create table t (c clob);

    declare
    x clob := chr(55184);
    begin
    dbms_output.put_line( LENGTHB(CHR(55184)) );

    insert into t (c)
    values ( rpad(x,5000,x) || chr(10) ||
    rpad(x,5000,x) || chr(10) ) ;

    end;
    /
    2

    select length(c) from t
    /

    LENGTH(C)
    ———
    10002

    select column_value
    from table (pipe_clob( (select c from t), 4000, chr(10)))
    /

    ORA-06502: PL/SQL: numeric or value error: character string buffer too small

    select length(column_value)
    from table (pipe_clob( (select c from t), 4000, chr(10)))
    /

    ORA-06502: PL/SQL: numeric or value error: character string buffer too small

    I think that the problem comes from this assignment:

    if l_buff_lengthb = 0 then
    l_buff_lengthb := lengthb(l_buffer);
    end if;

    When this happens, l_buff_lengthb can become much longer than the maximum of 4000 bytes,
    if l_buffer contains 4000 multi-byte characters, and then the assignment to l_out produces the error.

    Though not so usual in real-life, such long records will probably have to be truncated to 4000 bytes,
    if you want to keep the function compatible with all database versions.

    Thanks a lot & Best Regards,
    Iudith Mentzel

    • Hi Iudith,

      Thanks for your comment. You made me realize that my post was not sufficiently clear: PIPE_CLOB must “cut at record boundaries”. If a record is longer that 4000 bytes, this function cannot and should not work. I will try to improve the text.

      You also made me realize that my error handling was insufficient. I have tried to improve it.

      Best regards,
      Stew

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s