Read CLOBs fast with less memory


Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…

Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.

  • By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
  • Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.

One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.

UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.

UPDATE 2018-06-16 and -17: bug fixes for edge cases.

create or replace function pipe_clob (
  p_clob in clob,
  p_max_lengthb in integer default 4000,
  p_rec_delim in varchar2 default '
'
) return sys.odcivarchar2list pipelined authid current_user as
  l_amount integer;
  l_offset integer;
  l_buffer varchar2(4000 CHAR);
  l_out varchar2(4000 BYTE);
  l_buff_lengthb integer;
  l_occurence integer;
  l_rec_delim_length integer := length(p_rec_delim);
begin
  l_amount := p_max_lengthb;
  l_offset := 1;
  while l_amount = p_max_lengthb loop
    begin
      DBMS_LOB.READ (
       p_clob,
       l_amount,
       l_offset,
       l_buffer
      );
    exception when no_data_found then
      l_amount := 0;
    end;
    if l_amount = 0 then
      exit;
    elsif lengthb(l_buffer) < p_max_lengthb then
      pipe row(rtrim(l_buffer, p_rec_delim));
      exit;
    end if;
    l_buff_lengthb := p_max_lengthb + 1;
    l_occurence := 0;
    while l_buff_lengthb > p_max_lengthb loop
      l_occurence := l_occurence + 1;
      l_buff_lengthb := instrb(l_buffer,p_rec_delim,-1, l_occurence);
    end loop;
    if l_buff_lengthb = 0 then
      if l_amount = p_max_lengthb then
        raise_application_error(
          -20001, 
          'Input clob at offset '||l_offset||' for length '||l_amount||' has no record delimiter'
        );
      else
        l_buff_lengthb := lengthb(l_buffer)+l_rec_delim_length;
      end if;
    end if;
    l_out := substrb(l_buffer, 1, l_buff_lengthb-l_rec_delim_length);
    pipe row(l_out);
    l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length;
    l_amount := p_max_lengthb;
  end loop;
  return;
end;
/

In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s