Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…
Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.
- By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
- Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.
One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.
UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.
UPDATE 2018-06-16 and -17: bug fixes for edge cases.
create or replace function pipe_clob ( p_clob in clob, p_max_lengthb in integer default 4000, p_rec_delim in varchar2 default ' ' ) return sys.odcivarchar2list pipelined authid current_user as l_amount integer; l_offset integer; l_buffer varchar2(4000 CHAR); l_out varchar2(4000 BYTE); l_buff_lengthb integer; l_occurence integer; l_rec_delim_length integer := length(p_rec_delim); begin l_amount := p_max_lengthb; l_offset := 1; while l_amount = p_max_lengthb loop begin DBMS_LOB.READ ( p_clob, l_amount, l_offset, l_buffer ); exception when no_data_found then l_amount := 0; end; if l_amount = 0 then exit; elsif lengthb(l_buffer) < p_max_lengthb then pipe row(rtrim(l_buffer, p_rec_delim)); exit; end if; l_buff_lengthb := p_max_lengthb + 1; l_occurence := 0; while l_buff_lengthb > p_max_lengthb loop l_occurence := l_occurence + 1; l_buff_lengthb := instrb(l_buffer,p_rec_delim,-1, l_occurence); end loop; if l_buff_lengthb = 0 then if l_amount = p_max_lengthb then raise_application_error( -20001, 'Input clob at offset '||l_offset||' for length '||l_amount||' has no record delimiter' ); else l_buff_lengthb := lengthb(l_buffer)+l_rec_delim_length; end if; end if; l_out := substrb(l_buffer, 1, l_buff_lengthb-l_rec_delim_length); pipe row(l_out); l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length; l_amount := p_max_lengthb; end loop; return; end; /
In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.