Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…
Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.
- By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
- Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.
Again, the intent is to return as many entire records as will fit in the “bite”. If the input has a record that is longer than the maximum bite size, then this function will raise an exception.
One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.
UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.
UPDATE 2018-06-16 and -17: bug fixes for edge cases.
UPDATE 2018-06-28: better error handling. The maximum record length (in bytes) was actually 4000 including the record delimiter. It is now 4000 excluding the record delimiter.
create or replace function pipe_clob ( p_clob in clob, p_max_lengthb in integer default 4000, p_rec_delim in varchar2 default ' ' ) return sys.odcivarchar2list pipelined authid current_user as /* Break CLOB into VARCHAR2 sized bites. Reduce p_max_lengthb if you need to expand the VARCHAR2 in later processing. Last record delimiter in each bite is not returned, but if it is a newline and the output is spooled the newline will come back in the spooled output. Note: this cannot work if the CLOB contains more than <p_max_lengthb> consecutive bytes without a record delimiter. */ l_amount integer; l_offset integer; l_buffer varchar2(32767 BYTE); l_out varchar2(32767 BYTE); l_buff_lengthb integer; l_occurence integer; l_rec_delim_length integer := length(p_rec_delim); begin if p_max_lengthb > 4000 then raise_application_error(-20001, 'Maximum record length (p_max_lengthb) cannot be greater than 4000.'); elsif p_max_lengthb < 10 then raise_application_error(-20002, 'Maximum record length (p_max_lengthb) cannot be less than 10.'); end if; if p_rec_delim is null then raise_application_error(-20003, 'Record delimiter (p_rec_delim) cannot be null.'); end if; /* This version is limited to 4000 byte output, so I can afford to ask for 4001 in case the record is exactly 4000 bytes long. */ l_amount := p_max_lengthb + l_rec_delim_length; l_offset := 1; while l_amount = p_max_lengthb + l_rec_delim_length loop begin DBMS_LOB.READ ( p_clob, l_amount, l_offset, l_buffer ); exception when no_data_found then l_amount := 0; end; if l_amount = 0 then exit; elsif lengthb(l_buffer) <= p_max_lengthb then pipe row(rtrim(l_buffer, p_rec_delim)); exit; end if; l_buff_lengthb := p_max_lengthb + l_rec_delim_length; l_occurence := 0; while l_buff_lengthb > p_max_lengthb loop l_occurence := l_occurence + 1; l_buff_lengthb := instrb(l_buffer,p_rec_delim, -1, l_occurence) - 1; end loop; if l_buff_lengthb < 0 then if l_amount = p_max_lengthb + l_rec_delim_length then raise_application_error( -20004, 'Input clob at offset '||l_offset||' for lengthb '||p_max_lengthb||' has no record delimiter' ); end if; end if; l_out := substrb(l_buffer, 1, l_buff_lengthb); pipe row(l_out); l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length; l_amount := p_max_lengthb + l_rec_delim_length; end loop; return; end; /
In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.