Extract CSV from CLOB with JSON arrays

Marc Bleron blogged about CSV CLOBs and JSON_TABLE two years ago. Here’s my contribution to improve on a great idea.

“Simple” CSV

The Oracle Database does a fine job of parsing CSV data in flat files, using the External Tables facility. Unfortunately, this service is not available for CSV data in VARCHARs or CLOBs. Marc showed that JSON_TABLE (and XML_TABLE) can parse “simple” CSV if it is reformatted. What is “simple”?

CSV data consists of records and fields within records.

  • Records are delimited by NEWLINE (or some other string).
  • Fields are terminated by commas (or some other string),
  • If necessary, some or all fields can be enclosed by double quotes " (or some other string).

When Marc says “simple”, he means that fields are never enclosed. This is important, because enclosed fields may contain the double quote (provided it is present twice in a row) and / or the comma. With “simple” CSV, we know that all commas are true field terminators and we don’t have to replace "" with " .

“Simple” also means that there is no trimming of whitespace: you get everything between the commas.

Finally, Marc assumes there is no terminator after the last field of the record, even though Oracle allows it.

So, “simple” CSV has delimited records with terminated fields that are never enclosed. There is no trimming of whitespace and the last field in the record is not terminated.

My contribution

  • First of all, I break the CLOB into VARCHAR2 bites using the pipelined table function PIPE_CLOB (as explained in my previous post).
  • Then I remove any field terminator that immediately precedes a record delimiter.
  • Then I use JSON_ARRAY over the entire VARCHAR2 in case some characters need to be escaped.
  • Then I do several REPLACES such that:
    • each record becomes a JSON array of string values, and
    • those arrays are included in one overall array.
  • Finally, I use JSON_TABLE to break the overall array into rows and the inner arrays into columns.

Note that everything before the COLUMNS clause in JSON_TABLE is generic, because the inner arrays can contain any number of elements.

To demonstrate, here is a CLOB containing data from the EMP table, with a trailing comma added:

7369,SMITH,CLERK,7902,1980-12-17T00:00:00,800,,20,
7499,ALLEN,SALESMAN,7698,1981-02-20T00:00:00,1600,300,30,
7521,WARD,SALESMAN,7698,1981-02-22T00:00:00,1250,500,30,
7566,JONES,MANAGER,7839,1981-04-02T00:00:00,2975,,20,
7654,MARTIN,SALESMAN,7698,1981-09-28T00:00:00,1250,1400,30,
7698,BLAKE,MANAGER,7839,1981-05-01T00:00:00,2850,,30,
7782,CLARK,MANAGER,7839,1981-06-09T00:00:00,2450,,10,
7839,KING,PRESIDENT,,1981-11-17T00:00:00,5000,,10,
7844,TURNER,SALESMAN,7698,1981-09-08T00:00:00,1500,0,30,
7876,ADAMS,CLERK,7788,1987-05-23T00:00:00,1100,,20,
7900,JAMES,CLERK,7698,1981-12-03T00:00:00,950,,30,
7902,FORD,ANALYST,7566,1981-12-03T00:00:00,3000,,20,
7934,MILLER,CLERK,7782,1982-01-23T00:00:00,1300,,10,

And the code:

with last_term_removed as (
  select replace(column_value, ','||chr(10), chr(10)) str
  from table(pipe_clob((select c from t), 3500))
)
, json_data as (
  select '[' ||
    replace (
      replace(json_array(str), ',', '","'),
      '\n',
      '"],["'
    )
    || ']' jstr  
  from last_term_removed
)
select sql_data.*
from json_data j, json_table(
   j.jstr, '$[*]'
   columns empno    number        path '$[0]'
         , ename    varchar2(128) path '$[1]'
         , job      varchar2(128) path '$[2]'
         , mgr      number        path '$[3]'
         , hiredate date          path '$[4]'
         , sal      number        path '$[5]'
         , comm     number        path '$[6]'
         , deptno   number        path '$[7]'
) sql_data;
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
7369 SMITH CLERK 7902 1980-12-17 800 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 1250 500 30
7566 JONES MANAGER 7839 1981-04-02 2975 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30
7698 BLAKE MANAGER 7839 1981-05-01 2850 30
7782 CLARK MANAGER 7839 1981-06-09 2450 10
7839 KING PRESIDENT 1981-11-17 5000 10
7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30
7876 ADAMS CLERK 7788 1987-05-23 1100 20
7900 JAMES CLERK 7698 1981-12-03 950 30
7902 FORD ANALYST 7566 1981-12-03 3000 20
7934 MILLER CLERK 7782 1982-01-23 1300 10

 

Scalability: yes!

When Marc tested reading CLOBs directly, performance went bad as the CLOB increased in size:

Rows Seconds
91336 2.2
182672 4.3
365344 9.4
730688 22
1461376 840

 

In my tests with very similar data, the number of rows per second remains about the same:

LINES SECS LINES_PER_SEC AVG_LINES_PER_SEC PCT_DIFF
91336 1.374 66475 68517 -3
182672 2.6 70258 68517 2.5
365344 5.35 68289 68517 -.3
730688 10 73069 68517 6.6
1461376 22 66426 68517 -3.1

 

Advertisements

2 thoughts on “Extract CSV from CLOB with JSON arrays

  1. Stew,

    I had a real-life problem last year where I had to parse CSV data in a CLOB. In my case, however, the CSV data was on an 11g database and included quoted values that contained field and record delimiters as well as duplicated quotes. I ended up using regular expressions and posted a blog about it

    https://tonyhasler.wordpress.com/2017/07/02/csv-parsing-and-tokenizing-strings/

    Sometime after posting this I found that parsing CLOBs using regular expressions was pretty fast for temporary clobs and cached permanent clobs. But any type of regular expression parsing of non-cached permanent CLOBs was horrendous because of the repeated direct path reads as the parser moved up and down the CLOB.

    Obviously, you can workaround the issue by taking a copy of a permanent CLOB before parsing it.

    I am curious as to how the performance of your solution varies depending on CLOB type.

    • Hi Tony,

      If you refer to the little table at the end of my post:

      I ran the tests with the same data, but with LOB(C) STORE AS (CACHE). I also did a preliminary query to put as much of the CLOB as possible in the cache.

      For the first three tests (smaller CLOBs), with CACHE the test ran 30% to 33% faster.

      Once I got to the fourth test, run time was practically identical. I assume this is due to the limited size of my buffer cache.

      Best regards,
      Stew

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s