CSV from CLOB with field enclosures

After my post about extracting CSV without enclosures from a CLOB, here is my solution for CSV with enclosures. It wasn’t easy…

How Oracle parses CSV

We can parse CSV in files using SQL*Loader, but I prefer External Tables with the ORACLE_LOADER access driver. Suppose an External Table has these parameters:

records delimited by NEWLINE
SKIP 0
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' NOTRIM
(or FIELDS CSV WITHOUT EMBEDDED NOTRIM in 12.2)

From the documentation and testing, what I see is this:

  1. With these parameters, we cannot embed record delimiters within quotes.
  2. When the field has no enclosures, all characters between commas are output, whitespace or not.
  3. When there are enclosures:
    1. Whitespace is allowed and stripped before and after enclosed fields, even with NOTRIM
    2. field terminators can be embedded between the enclosures
    3. record delimiters cannot be embedded: they end the record
    4. to be enclosed, field enclosures must be doubled: they are undoubled in the output
    5. Whitespace characters are: all characters considered whitespace by REGEXP ‘\s’
      9-13, 32 (space) – and 17 others if AL32UTF8!
    6. No second enclosure > error
    7. No delimiter between enclosed fields > error
    8. If there is a field delimiter at the end of a record, it ends the field but does not start a new field.

I admit that 3.A. (whitespace OK before and after enclosures) was a surprise to me.

My objective

I want to emulate the parameters listed above but I also want decent performance. For performance reasons, I decided to simplify the rules I follow:

  • 3.A. : no whitespace allowed before or after enclosed fields
  • 3.E. : I don’t care whether a character is whitespace or not.
  • 3.G. : the error is now “second enclosure not followed by a field delimiter”
  • Instead of LOG and BAD files, I log errors and bad input with DBMS_OUTPUT.PUT_LINE
  • I ignore extra newlines in the input, but I don’t attempt to suppress output with all null fields.

The code

This code should probably be written in C, which allows direct access to each character in a string. However, anyone who can install a C program on the Database server can probably copy the CLOB to a file and use External Table functionality directly! To make my PL/SQL as fast as possible, I use a technique from my COBOL and C days: the GOTO. With this type of code, calling subprocedures would measurably increase CPU usage.

create or replace function parse_csv_json_array(
  p_str in varchar2,
  p_log integer default null
) return varchar2 authid current_user is
/*
Objective: minimal service for parsing CSV with enclosures, similar to:
RECORDS DELIMITED BY NEWLINE
SKIP 0
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' NOTRIM
(or FIELDS CSV WITHOUT EMBEDDED NOTRIM in 12.2)
> EXCEPT: I allow no whitespace between field enclosures and field terminators.

- record delimiter = NEWLINE, field terminator = "," and field enclosure = '"'
  all are hard coded for the moment.
- record delimiter cannot be embedded in enclosed field.
- without enclosures, everything between field terminators and / or record delimiters is output
- with enclosures:
  - no whitespace allowed before or after enclosed fields
  - field terminators can be embedded between the enclosures
  - to be enclosed, field enclosures must be doubled: they are undoubled in the output
  - no second enclosure > 'Ending field enclosure not found'
  - no terminator after enclosed field > error 'Ending field enclosure not followed by field terminator'
  - If there is a field delimiter at the end of a record, it ends the field but does not start a new field.
- Instead of LOG and BAD files, I log errors and bad input with DBMS_OUTPUT.PUT_LINE
- I ignore extra newlines in the input, but I do not try to suppress output with all null fields.
- The input contains multiple records, so create an array of arrays: one inner array per record
*/
  l_out varchar2(4000);
  i_str integer;
  max_i_str integer;
  max_i_rec integer;
  i_end integer;
  i_num_backslash integer;
begin
  if p_str = '[]' then
    return p_str;
  end if;
  i_str := 3;
  max_i_rec := 2 - 2; -- end of "preceding record", counting 2 for length of record delimiter
  max_i_str := length(p_str) - 2;
  l_out := '[["';

  <<start_record>>
  i_end := max_i_rec + 3; -- length of record delimiter + 1
  i_end := instr(p_str, '\n', i_end);
  if i_end = 0 or i_end is null then
    -- end of record and end of input
    max_i_rec := max_i_str;
  else
    max_i_rec := i_end - 1;
    -- found \n at beginning, skip
    if max_i_rec < i_str then
      i_str := i_str + 2; -- length of record delimiter
      goto start_record;
    end if;
    -- check for '\\' before n
    i_num_backslash := 0;
    while substr(p_str, i_end-i_num_backslash-1,1) = '\' loop
      i_num_backslash := i_num_backslash + 1;
    end loop;
    if mod(i_num_backslash,2) = 1 then
      -- false alert, there was '\n' in the input and json_array made it '\\n'
      goto start_record;
    end if;
  end if;

  <<start_field>>
  if substr(p_str, i_str, 2) = '\"' then
    -- enclosures, so must do one character at a time
    i_str := i_str + 2;
    goto during_enclosed;
  else
    -- no enclosures, search for end of field in record
    i_end := instr(substr(p_str, i_str, max_i_rec - i_str + 1)||',', ',');
    l_out := l_out || substr(p_str, i_str, i_end - 1);
    i_str := i_str + i_end;
    if i_str <= max_i_rec + 1 then
      -- end of field, not end of record
      l_out := l_out || '","';
      goto start_field;
    elsif max_i_rec < max_i_str then
      -- last field of record, not last record
      i_str := max_i_rec + 3;
      l_out := l_out || '"],["';
      goto start_record;
    else
      -- last field of record, last record of input
      l_out := l_out || '"]]';
      goto end_input;
    end if;
  end if;

  <<during_enclosed>>
  i_end := instr(p_str, '\"', i_str);
  if i_end = 0 or i_end > max_i_rec then
    dbms_output.put_line('Ending field enclosure not found, input follows:');
    dbms_output.put_line('<'||json_value(p_str, '$[0]')||'>');
    l_out := null;
    goto end_input;
  end if;
  l_out := l_out || substr(p_str, i_str, i_end - i_str);
  i_str := i_end + 2;
  if substr(p_str, i_str, 2) = '\"' then
    l_out := l_out || '\"';
    i_str := i_str + 2;
  elsif substr(p_str, i_str, 1) = ',' then
      l_out := l_out || '","';
      i_str := i_str + 1;
      goto start_field;
  elsif i_str > max_i_str then
    l_out := l_out || '"]]';
    goto end_input;
  elsif i_str > max_i_rec then
    l_out := l_out || '"],["';
    i_str := max_i_rec + 3;
    goto start_record;
  else
    dbms_output.put_line('Ending field enclosure not followed by field terminator, input follows:');
    dbms_output.put_line('<'||json_value(p_str, '$[0]')||'>');
    l_out := null;
    goto end_input;
  end if;
  goto during_enclosed;

  <<end_input>>
  return l_out;
end parse_csv_json_array;
/

Test results

I took five columns from DBA_OBJECTS and added one VARCHAR2(100) and one NUMBER(4). I tested with four sizes of CLOBS: 100,000 records, 200,000 records, 400,000 records and 800,000 records. I compared input with no enclosures and with all fields enclosed.

Compared to the “simple” CSV solution, this solution is about 50% to 70% slower, depending on how many fields are enclosed. However, the number of records extracted per second remains stable as volume increases.

Please let me know if this solution is of practical interest to anyone…

Advertisements

Extract from CLOB with JSON objects

On AskTOM, Kim Berg Hansen recently used JSON_OBJECT to parse CLOB data in “name=value” format. I added a variant based on my work with CSV data.

Kim decided to use objects instead of arrays for an excellent and important reason:

  • a JSON object is an unordered collection of name/value pairs.
  • a JSON array is an ordered list of values.

CSV data is an ordered list of values, going from left to right, so a JSON array is the obvious choice. The AskTOM question concerned name/value pairs that were not necessarily in the same order and not necessarily all there! A JSON object was the natural choice.

I’m really liking JSON more and more: not only is it simple, but it seems pretty easy to figure out how best to use it for different problems.

Instead of copying everything here, I’m going to be lazy for once and invite you to read my contribution here:

More on JSON_TABLE

You will see some of the same ideas:

  • Use PIPE_CLOB to cut the CLOB into VARCHAR sized bites;
  • Use JSON_ARRAY on the whole bite to escape characters if needed;
  • Use REPLACE to form an overall array, but this time of objects;
  • Then use JSON_TABLE to generate one row per object and one column per value.

Hope this helps…

Extract CSV from CLOB with JSON arrays

Marc Bleron blogged about CSV CLOBs and JSON_TABLE two years ago. Here’s my contribution to improve on a great idea.

“Simple” CSV

The Oracle Database does a fine job of parsing CSV data in flat files, using the External Tables facility. Unfortunately, this service is not available for CSV data in VARCHARs or CLOBs. Marc showed that JSON_TABLE (and XML_TABLE) can parse “simple” CSV if it is reformatted. What is “simple”?

CSV data consists of records and fields within records.

  • Records are delimited by NEWLINE (or some other string).
  • Fields are terminated by commas (or some other string),
  • If necessary, some or all fields can be enclosed by double quotes " (or some other string).

When Marc says “simple”, he means that fields are never enclosed. This is important, because enclosed fields may contain the double quote (provided it is present twice in a row) and / or the comma. With “simple” CSV, we know that all commas are true field terminators and we don’t have to replace "" with " .

“Simple” also means that there is no trimming of whitespace: you get everything between the commas.

Finally, Marc assumes there is no terminator after the last field of the record, even though Oracle allows it.

So, “simple” CSV has delimited records with terminated fields that are never enclosed. There is no trimming of whitespace and the last field in the record is not terminated.

My contribution

  • First of all, I break the CLOB into VARCHAR2 bites using the pipelined table function PIPE_CLOB (as explained in my previous post).
  • Then I remove any field terminator that immediately precedes a record delimiter.
  • Then I use JSON_ARRAY over the entire VARCHAR2 in case some characters need to be escaped.
  • Then I do several REPLACES such that:
    • each record becomes a JSON array of string values, and
    • those arrays are included in one overall array.
  • Finally, I use JSON_TABLE to break the overall array into rows and the inner arrays into columns.

Note that everything before the COLUMNS clause in JSON_TABLE is generic, because the inner arrays can contain any number of elements.

To demonstrate, here is a CLOB containing data from the EMP table, with a trailing comma added:

7369,SMITH,CLERK,7902,1980-12-17T00:00:00,800,,20,
7499,ALLEN,SALESMAN,7698,1981-02-20T00:00:00,1600,300,30,
7521,WARD,SALESMAN,7698,1981-02-22T00:00:00,1250,500,30,
7566,JONES,MANAGER,7839,1981-04-02T00:00:00,2975,,20,
7654,MARTIN,SALESMAN,7698,1981-09-28T00:00:00,1250,1400,30,
7698,BLAKE,MANAGER,7839,1981-05-01T00:00:00,2850,,30,
7782,CLARK,MANAGER,7839,1981-06-09T00:00:00,2450,,10,
7839,KING,PRESIDENT,,1981-11-17T00:00:00,5000,,10,
7844,TURNER,SALESMAN,7698,1981-09-08T00:00:00,1500,0,30,
7876,ADAMS,CLERK,7788,1987-05-23T00:00:00,1100,,20,
7900,JAMES,CLERK,7698,1981-12-03T00:00:00,950,,30,
7902,FORD,ANALYST,7566,1981-12-03T00:00:00,3000,,20,
7934,MILLER,CLERK,7782,1982-01-23T00:00:00,1300,,10,

And the code:

with last_term_removed as (
  select replace(column_value, ','||chr(10), chr(10)) str
  from table(pipe_clob((select c from t), 3500))
)
, json_data as (
  select '[' ||
    replace (
      replace(json_array(str), ',', '","'),
      '\n',
      '"],["'
    )
    || ']' jstr  
  from last_term_removed
)
select sql_data.*
from json_data j, json_table(
   j.jstr, '$[*]'
   columns empno    number        path '$[0]'
         , ename    varchar2(128) path '$[1]'
         , job      varchar2(128) path '$[2]'
         , mgr      number        path '$[3]'
         , hiredate date          path '$[4]'
         , sal      number        path '$[5]'
         , comm     number        path '$[6]'
         , deptno   number        path '$[7]'
) sql_data;
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
7369 SMITH CLERK 7902 1980-12-17 800 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 1250 500 30
7566 JONES MANAGER 7839 1981-04-02 2975 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30
7698 BLAKE MANAGER 7839 1981-05-01 2850 30
7782 CLARK MANAGER 7839 1981-06-09 2450 10
7839 KING PRESIDENT 1981-11-17 5000 10
7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30
7876 ADAMS CLERK 7788 1987-05-23 1100 20
7900 JAMES CLERK 7698 1981-12-03 950 30
7902 FORD ANALYST 7566 1981-12-03 3000 20
7934 MILLER CLERK 7782 1982-01-23 1300 10

 

Scalability: yes!

When Marc tested reading CLOBs directly, performance went bad as the CLOB increased in size:

Rows Seconds
91336 2.2
182672 4.3
365344 9.4
730688 22
1461376 840

 

In my tests with very similar data, the number of rows per second remains about the same:

LINES SECS LINES_PER_SEC AVG_LINES_PER_SEC PCT_DIFF
91336 1.374 66475 68517 -3
182672 2.6 70258 68517 2.5
365344 5.35 68289 68517 -.3
730688 10 73069 68517 6.6
1461376 22 66426 68517 -3.1

 

Read CLOBs fast with less memory


Reading a big CLOB is like trying to eat a burger all at once: it will take you forever if you don’t choke. Why not cut that CLOB into bite-size chunks? It’s faster, uses less memory – and it’s good table manners…

Marc Bleron has blogged about parsing CSV data in CLOBs. As the CLOB got bigger, the parsing time went way up. I decided to write a pipelined table function that would return the CLOB in VARCHAR2-sized bites, cutting at record boundaries.

  • By default, the maximum bite size is 4000 bytes. You can make it less if you need some room to use REPLACE.
  • Also by default, the record delimiter is the NEWLINE of your operating system, but you can change it.

One of the pains of CLOBs is that LENGTH() and SUBSTR() deal with characters only: the LENGTHB() and SUBSTRB() functions deal with bytes, but they are limited to VARCHAR2. Fortunately, VARCHAR2 in PL/SQL can be much longer, so I read 4000 characters into a buffer and then I cut off any records that overflow the 4000-byte boundary.

UPDATE 2018-06-15: I renamed “rec_term” to “rec_delim” because external table definitions say “records delimited by”. I now raise an exception if a bite does not contain the record delimiter.

UPDATE 2018-06-16 and -17: bug fixes for edge cases.

create or replace function pipe_clob (
  p_clob in clob,
  p_max_lengthb in integer default 4000,
  p_rec_delim in varchar2 default '
'
) return sys.odcivarchar2list pipelined authid current_user as
  l_amount integer;
  l_offset integer;
  l_buffer varchar2(4000 CHAR);
  l_out varchar2(4000 BYTE);
  l_buff_lengthb integer;
  l_occurence integer;
  l_rec_delim_length integer := length(p_rec_delim);
begin
  l_amount := p_max_lengthb;
  l_offset := 1;
  while l_amount = p_max_lengthb loop
    begin
      DBMS_LOB.READ (
       p_clob,
       l_amount,
       l_offset,
       l_buffer
      );
    exception when no_data_found then
      l_amount := 0;
    end;
    if l_amount = 0 then
      exit;
    elsif lengthb(l_buffer) < p_max_lengthb then
      pipe row(rtrim(l_buffer, p_rec_delim));
      exit;
    end if;
    l_buff_lengthb := p_max_lengthb + 1;
    l_occurence := 0;
    while l_buff_lengthb > p_max_lengthb loop
      l_occurence := l_occurence + 1;
      l_buff_lengthb := instrb(l_buffer,p_rec_delim,-1, l_occurence);
    end loop;
    if l_buff_lengthb = 0 then
      if l_amount = p_max_lengthb then
        raise_application_error(
          -20001, 
          'Input clob at offset '||l_offset||' for length '||l_amount||' has no record delimiter'
        );
      else
        l_buff_lengthb := lengthb(l_buffer)+l_rec_delim_length;
      end if;
    end if;
    l_out := substrb(l_buffer, 1, l_buff_lengthb-l_rec_delim_length);
    pipe row(l_out);
    l_offset := l_offset + nvl(length(l_out),0) + l_rec_delim_length;
    l_amount := p_max_lengthb;
  end loop;
  return;
end;
/

In my laptop tests, read time increased linearly with the number of records. If you try it, let me know how it works out.

Generate JOINs among related tables

After finding the shortest “foreign key” path between two tables, I’ll try to generate a SELECT statement that joins all the related tables.

(Please don’t tell me that Oracle SQL Developer does this: I know it does, as long as you already know the path and select all the tables yourself! I find a path for you and make sure the column names are unique. Despite these differences, you will probably never need this code if you use SQL Developer.)

  • The final SELECT starts with the bottom of the hierarchy and works its way up.
  • Table aliases are T1, T2 and so on.
  • All columns of all tables are selected (based on ALL_TAB_COLUMNS, not ALL_TAB_COLS).
  • To make columns unique, they are prefixed with the table alias: T1_* and so on.
  • Table names and column names are only enclosed in double quotes when they contain non-alphanumeric characters. Exceptionally, I always enclose the join columns in double quotes just to save myself some trouble…
  • I use ANSI join syntax, and when I do the equality comparison I use parentheses like this:
    on (T1.COL1, T1.COL2) = ((T2.COL1, T2.COL2))
    It’s a bit easier to code. Don’t be surprised to see the parentheses even when there is only one column.

The code is more than 100 lines long. I’ll put my inadequate explanation first, then the code.

  • search_for: the owners and tables I want to find the path between.
  • pur: the constraints I need to start with.
  • relations: add to each constraint row the remote or parent table.
  • best_path: returns a string that contains all the data from all the rows in the path
  • best_path_tables: breaks down the string into the rows and columns of the path I found. I use REGEXP functions instead of JSON functions so this should work from version 11.2.
  • final_select: get all the columns for the final SELECT, accessing ALL_TAB_COLUMNS
  • final_from: get the FROM and JOIN clauses, accessing ALL_CONS_COLUMNS.

I’ll sign off before posting the code. I would say “hope this helps”, but I feel more like saying

Hope this works!

with search_for (owner, table_name) as (
  select 'OBE', 'OEHR_REGIONS' from dual union all
  select 'OBE', 'OEHR_ORDER_ITEMS' from dual
)
, pur as (
  select OWNER,   CONSTRAINT_NAME, TABLE_NAME,
       R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner in (select username from dba_users where oracle_maintained = 'N')
)
, relations as (
  select a.owner, a.table_name, a.constraint_name,
       a.r_owner, b.table_name r_table_name, a.r_constraint_name
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)
, best_path as (
  select path
  from (
    select row_number() over( order by
      case when (r_owner, r_table_name) 
            != ((connect_by_root(owner), connect_by_root(table_name)))
        then level end
      ) rn, 
      sys_connect_by_path(
      owner  ||'/'||table_name  ||'/'||  constraint_name||'/'||
      r_owner||'/'||r_table_name||'/'||r_constraint_name,'<') path
    from relations a
    where (r_owner, r_table_name) in (select * from search_for)
    start with (owner, table_name) in (select * from search_for)
    connect by nocycle (owner, table_name) = ((prior r_owner, prior r_table_name))
    and (owner, table_name) not in (select * from search_for)
  )
  where rn = 1
)
, best_path_tables as (
  select level lvl,
    'T'||level al,
    'T'||(level+1) r_al,
    regexp_substr(path, '<(.+?)/',            1, level,'i',1) owner,
    regexp_substr(path, '<(.+?/){1}(.+?)/',   1, level,'i',2) table_name,
    regexp_substr(path, '<(.+?/){2}(.+?)/',   1, level,'i',2) constraint_name,
    regexp_substr(path, '<(.+?/){3}(.+?)/',   1, level,'i',2) r_owner,
    regexp_substr(path, '<(.+?/){4}(.+?)/',   1, level,'i',2) r_table_name,
    regexp_substr(path, '<(.+?/){5}([^/<]+)', 1, level,'i',2) r_constraint_name
  from best_path
  connect by level <= regexp_count (path, '<')
)
, final_select as (
  select lvl, pref ||
  listagg(al||'.'||qu||column_name||qu||' '||qu||al||'_'||column_name||qu,',')
    within group(order by column_id) str
  from (
    select -99 lvl, 'select ' pref, al, column_name, column_id,
      case when not regexp_like (column_name, '^[A-Z][A-Z0-9$#_]*$') then '"' end qu
    from best_path_tables join all_tab_columns using(owner, table_name)
    where lvl = 1
    union all
    select lvl-99, ',', r_al, column_name, column_id,
      case when not regexp_like (column_name, '^[A-Z][A-Z0-9$#_]*$') then '"' end qu
    from best_path_tables b
    join all_tab_columns a on (b.r_owner, b.r_table_name) = ((a.owner, a.table_name))
  )
  group by lvl, pref
)
, final_from as (
  select 0 lvl, 'from '||oq||owner||oq||'.'||tq||table_name||tq||' '|| al str
  from (
    select a.*,
      case when not regexp_like (owner, '^[A-Z][A-Z0-9$#_]*$') then '"' end oq,
      case when not regexp_like (table_name, '^[A-Z][A-Z0-9$#_]*$') then '"' end tq
      from best_path_tables a
      where lvl = 1
  ) a
  union all
  select lvl, 'join '||oq||r_owner||oq||'.'||tq||r_table_name||tq||' '|| r_al || ' on (' ||
  (
    select listagg(al||'."'||column_name||'"',',') within group(order by position)
    from all_cons_columns 
    where (owner, table_name, constraint_name) = ((a.owner, a.table_name, a.constraint_name))
  )
  || ') = ((' ||
  (
    select listagg(r_al||'."'||column_name||'"',',') within group(order by position)
    from all_cons_columns 
    where (owner, table_name, constraint_name) = ((a.r_owner, a.r_table_name, a.r_constraint_name))
  )
  || '))'
  from (
    select a.*,
      case when not regexp_like (r_owner, '^[A-Z][A-Z0-9$#_]*$') then '"' end oq,
      case when not regexp_like (r_table_name, '^[A-Z][A-Z0-9$#_]*$') then '"' end tq
      from best_path_tables a
  ) a
)
select str from (
  select * from final_select
  union all
  select * from final_from
)
order by lvl;
select T1.ORDER_ID T1_ORDER_ID,T1.LINE_ITEM_ID T1_LINE_ITEM_ID,T1.PRODUCT_ID T1_PRODUCT_ID,T1.UNIT_PRICE T1_UNIT_PRICE,T1.QUANTITY T1_QUANTITY,T1.ORDER_ITEM_ID T1_ORDER_ITEM_ID
,T2.ORDER_ID T2_ORDER_ID,T2.ORDER_DATE T2_ORDER_DATE,T2.ORDER_MODE T2_ORDER_MODE,T2.CUSTOMER_ID T2_CUSTOMER_ID,T2.ORDER_STATUS T2_ORDER_STATUS,T2.ORDER_TOTAL T2_ORDER_TOTAL,T2.SALES_REP_ID T2_SALES_REP_ID,T2.PROMOTION_ID T2_PROMOTION_ID
,T3.EMPLOYEE_ID T3_EMPLOYEE_ID,T3.FIRST_NAME T3_FIRST_NAME,T3.LAST_NAME T3_LAST_NAME,T3.EMAIL T3_EMAIL,T3.PHONE_NUMBER T3_PHONE_NUMBER,T3.HIRE_DATE T3_HIRE_DATE,T3.JOB_ID T3_JOB_ID,T3.SALARY T3_SALARY,T3.COMMISSION_PCT T3_COMMISSION_PCT,T3.MANAGER_ID T3_MANAGER_ID,T3.DEPARTMENT_ID T3_DEPARTMENT_ID
,T4.DEPARTMENT_ID T4_DEPARTMENT_ID,T4.DEPARTMENT_NAME T4_DEPARTMENT_NAME,T4.MANAGER_ID T4_MANAGER_ID,T4.LOCATION_ID T4_LOCATION_ID
,T5.LOCATION_ID T5_LOCATION_ID,T5.STREET_ADDRESS T5_STREET_ADDRESS,T5.POSTAL_CODE T5_POSTAL_CODE,T5.CITY T5_CITY,T5.STATE_PROVINCE T5_STATE_PROVINCE,T5.COUNTRY_ID T5_COUNTRY_ID
,T6.COUNTRY_ID T6_COUNTRY_ID,T6.COUNTRY_NAME T6_COUNTRY_NAME,T6.REGION_ID T6_REGION_ID
,T7.REGION_ID T7_REGION_ID,T7.REGION_NAME T7_REGION_NAME
from OBE.OEHR_ORDER_ITEMS T1
join OBE.OEHR_ORDERS T2 on (T1."ORDER_ID") = ((T2."ORDER_ID"))
join OBE.OEHR_EMPLOYEES T3 on (T2."SALES_REP_ID") = ((T3."EMPLOYEE_ID"))
join OBE.OEHR_DEPARTMENTS T4 on (T3."DEPARTMENT_ID") = ((T4."DEPARTMENT_ID"))
join OBE.OEHR_LOCATIONS T5 on (T4."LOCATION_ID") = ((T5."LOCATION_ID"))
join OBE.OEHR_COUNTRIES T6 on (T5."COUNTRY_ID") = ((T6."COUNTRY_ID"))
join OBE.OEHR_REGIONS T7 on (T6."REGION_ID") = ((T7."REGION_ID"))

Split a string again and again

I recently needed to gather multiple columns from multiple rows into one string, then split them out again. JSON_TABLE did both splits at the same time!

The situation

I want to find the shortest “foreign key” path between two tables in the OBE schema: OEHR_ORDER_ITEMS and OEHR_DEPARTMENTS. I also want the intermediate tables and the constraints used to connect the tables. As a starting point for today, I’ll create a little table with the data for the tables in the OBE schema that start with ‘OEHR’:

drop table t purge;
create table t as
with search_for (owner, table_name) as (
  select 'OBE', 'OEHR_DEPARTMENTS' from dual union all
  select 'OBE', 'OEHR_ORDER_ITEMS' from dual
)
, pur as (
  select OWNER, CONSTRAINT_NAME, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner = 'OBE'
  and table_name like 'OEHR%'
)
, relations as (
  select a.table_name, a.constraint_name,
    b.table_name r_table_name, a.r_constraint_name
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)
select * from relations;

Now I’ll use a hierarchical query to find paths from one table to the other:

select rownum rn, level lvl, TABLE_NAME, CONSTRAINT_NAME, R_TABLE_NAME, R_CONSTRAINT_NAME
from t a
start with table_name = 'OEHR_ORDER_ITEMS'
connect by nocycle table_name = prior r_table_name
and table_name != 'OEHR_DEPARTMENTS';
RN LVL TABLE_NAME CONSTRAINT_NAME R_TABLE_NAME R_CONSTRAINT_NAME
1 1 OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_ORDER_ID_FK OEHR_ORDERS OEHR_ORDER_PK
2 2 OEHR_ORDERS OEHR_ORDERS_CUSTOMER_ID_FK OEHR_CUSTOMERS OEHR_CUSTOMERS_PK
3 3 OEHR_CUSTOMERS OEHR_CUST_ACCT_MGR_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
4 4 OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK
5 4 OEHR_EMPLOYEES OEHR_EMP_JOB_FK OEHR_JOBS OEHR_JOB_ID_PK
6 2 OEHR_ORDERS OEHR_ORDERS_SALES_REP_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
7 3 OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK
8 3 OEHR_EMPLOYEES OEHR_EMP_JOB_FK OEHR_JOBS OEHR_JOB_ID_PK
9 1 OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_PRODUCT_ID_FK OEHR_PRODUCT_INFORMATION OEHR_PRODUCT_INFO_PK

 

Just looking at the data myself, I can see that the shortest path will include rows 1, 6 and 7. Rows 1 through 4 will work too, but the path is longer.

Getting the data for the whole path

The problem is, how can I identify those three rows I want and just query them? The only way I see is to use SYS_CONNECT_BY_PATH (or the equivalent with a CTE) to string together all the data from all the rows I want – which means I have to split that data out again later.

To simplify what follows, I’m going to manually delete the rows that don’t matter:

delete from t where table_name not in ('OEHR_ORDER_ITEMS','OEHR_ORDERS','OEHR_EMPLOYEES');
delete from t where r_table_name not in ('OEHR_DEPARTMENTS','OEHR_ORDERS','OEHR_EMPLOYEES');
commit;

The rows that are left contain the data I want:

TABLE_NAME CONSTRAINT_NAME R_TABLE_NAME R_CONSTRAINT_NAME
OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_ORDER_ID_FK OEHR_ORDERS OEHR_ORDER_PK
OEHR_ORDERS OEHR_ORDERS_SALES_REP_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK

 

Now I’ll string that data together:

select sys_connect_by_path(
  table_name  ||'/'||  constraint_name||'/'||
  r_table_name||'/'||r_constraint_name
  ,'<'
) path
from t a
where r_table_name = 'OEHR_DEPARTMENTS'
start with table_name = 'OEHR_ORDER_ITEMS'
connect by table_name = prior r_table_name;

<OEHR_ORDER_ITEMS/OEHR_ORDER_ITEMS_ORDER_ID_FK/OEHR_ORDERS/OEHR_ORDER_PK
<OEHR_ORDERS/OEHR_ORDERS_SALES_REP_FK/OEHR_EMPLOYEES/OEHR_EMP_EMP_ID_PK
<OEHR_EMPLOYEES/OEHR_EMP_DEPT_FK/OEHR_DEPARTMENTS/OEHR_DEPT_ID_PK

I split the data into three lines for legibility. Note that I have to use different literals in the two parameters of SYS_CONNECT_BY_PATH, otherwise I get an exception.

Splitting out the data

Now we’re finally getting to the point. I can use JSON_ARRAY to string together the data for one row. Once I have all the data for all the rows, I can turn that into an array of arrays. Finally, JSON_TABLE will split everything out for me. Let’s start with JSON_ARRAY:

select sys_connect_by_path(
  json_array(table_name, constraint_name, r_table_name, r_constraint_name)
  , '
'
) path
from t a
where r_table_name = 'OEHR_DEPARTMENTS'
start with table_name = 'OEHR_ORDER_ITEMS'
connect by table_name = prior r_table_name;

["OEHR_ORDER_ITEMS","OEHR_ORDER_ITEMS_ORDER_ID_FK","OEHR_ORDERS","OEHR_ORDER_PK"]
["OEHR_ORDERS","OEHR_ORDERS_SALES_REP_FK","OEHR_EMPLOYEES","OEHR_EMP_EMP_ID_PK"]
["OEHR_EMPLOYEES","OEHR_EMP_DEPT_FK","OEHR_DEPARTMENTS","OEHR_DEPT_ID_PK"]

Now I’ll turn that data into an array of arrays:

with data as (
  select sys_connect_by_path(
    json_array(table_name, constraint_name, r_table_name, r_constraint_name)
    , '/'
  ) path
  from t a
  where r_table_name = 'OEHR_DEPARTMENTS'
  start with table_name = 'OEHR_ORDER_ITEMS'
  connect by table_name = prior r_table_name
)
select '['||substr(replace(path,'/',','),2) || ']' path
from data;
[
  [
    "OEHR_ORDER_ITEMS",
    "OEHR_ORDER_ITEMS_ORDER_ID_FK",
    "OEHR_ORDERS",
    "OEHR_ORDER_PK"
  ],
  [
    "OEHR_ORDERS",
    "OEHR_ORDERS_SALES_REP_FK",
    "OEHR_EMPLOYEES",
    "OEHR_EMP_EMP_ID_PK"
  ],
  [
    "OEHR_EMPLOYEES",
    "OEHR_EMP_DEPT_FK",
    "OEHR_DEPARTMENTS",
    "OEHR_DEPT_ID_PK"
  ]
]

Now JSON_TABLE will break down the top array into rows, just by putting ‘$[*]’ in the second parameter. The COLUMNS will do the rest, splitting the data in the inner arrays into different columns.

with data as (
  select sys_connect_by_path(
    json_array(table_name, constraint_name, r_table_name, r_constraint_name)
    , '/'
  ) path
  from t a
  where r_table_name = 'OEHR_DEPARTMENTS'
  start with table_name = 'OEHR_ORDER_ITEMS'
  connect by table_name = prior r_table_name
)
select a.* from data, json_table(
  '['||substr(replace(path,'/',','),2) || ']',
  '$[*]' columns(
    rn for ordinality,
    table_name varchar2(128) path '$[0]',
    constraint_name varchar2(128) path '$[1]',
    r_table_name varchar2(128) path '$[2]',
    r_constraint_name varchar2(128) path '$[3]'
  )
) a;
RN TABLE_NAME CONSTRAINT_NAME R_TABLE_NAME R_CONSTRAINT_NAME
1 OEHR_ORDER_ITEMS OEHR_ORDER_ITEMS_ORDER_ID_FK OEHR_ORDERS OEHR_ORDER_PK
2 OEHR_ORDERS OEHR_ORDERS_SALES_REP_FK OEHR_EMPLOYEES OEHR_EMP_EMP_ID_PK
3 OEHR_EMPLOYEES OEHR_EMP_DEPT_FK OEHR_DEPARTMENTS OEHR_DEPT_ID_PK

 

Whew!

P.S. If you think a CTE solution is cleaner:

with data( TABLE_NAME, R_TABLE_NAME, path ) as (
  select n.TABLE_NAME, n.R_TABLE_NAME,
    json_array(n.TABLE_NAME, n.CONSTRAINT_NAME, n.R_TABLE_NAME, n.R_CONSTRAINT_NAME)
  from t n
  where table_name = 'OEHR_ORDER_ITEMS'
  union all
  select n.TABLE_NAME, n.R_TABLE_NAME,
    o.path || ',' ||
    json_array(n.TABLE_NAME, n.CONSTRAINT_NAME, n.R_TABLE_NAME, n.R_CONSTRAINT_NAME)
  from t n
  join data o on (n.table_name = o.r_table_name)
    and n.table_name != 'OEHR_DEPARTMENTS'
)
select b.* from data a, json_table(
  '['|| path || ']',
  '$[*]' columns(
    rn for ordinality,
    table_name varchar2(128) path '$[0]',
    constraint_name varchar2(128) path '$[1]',
    r_table_name varchar2(128) path '$[2]',
    r_constraint_name varchar2(128) path '$[3]'
  )
) b
where a.r_table_name = 'OEHR_DEPARTMENTS';

Foreign keys between two tables

In a comment on my post about getting all the foreign key relationships, I was asked how to get the shortest path between two tables. Not too easy…

UPDATE: you can see how hard it is by the fact I have had to correct two bugs, one concerning direct or indirect self-references.

Using the ALL_CONSTRAINTS view, we can put together all the foreign key relationships between child tables and parent tables, and we can do hierarchical queries to go up and down branches of the hierarchy. If you give me two tables, though, how do I know whether to go up or down, and which table to start with?

The idea

I decided to start with both tables and go up. I expect children to outnumber parents, so going up should be more efficient. I start with both tables, knowing that at most one will be the descendant of the other. I then go up the hierarchy until I either get to the top or hit one of the tables. The “shortest path” is the one where the final LEVEL is smallest.

The code

I start with a little subquery that provides the target tables.

with search_for (owner, table_name) as (
  select 'OBE', 'OEHR_COUNTRIES' from dual union all
  select 'OBE', 'OEHR_ORDER_ITEMS' from dual
)

Then I get all the constraints that may be involved in foreign key relationships.

, pur as (
  select
    OWNER, CONSTRAINT_NAME, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner in (select username from dba_users where oracle_maintained = 'N')
) and owner in (select username from dba_users where oracle_maintained = 'N')
)

I do a self join on the PUR subquery to add the parent table to each row.
UPDATE: added a line to RELATIONS to avoid self-references.

, relations as (
  select a.owner, a.table_name,
    a.r_owner, b.table_name r_table_name
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)

Now I start going up the hierarchy:

  • I “start with” both tables
  • I “connect by” going up the hierarchy, stopping before the other table can be a child
    This is important, because the other table could be at the top of the hierarchy and not be a child at all.
  • The WHERE clause is applied after the CONNECT BY: I only keep rows where I found the other table.
  • I now have one row for each path from one table to the other. I use SYS_CONNECT_BY_PATH to concatenate the original child and all the parents going up the hierarchy.
    Notice I have to use one literal (<) within the first parameter and another (/) as the second parameter. I wanted to use ‘<‘ both places, but Oracle won’t let me.
  • Finally, I choose the shortest path, translate ‘/’ to ‘<‘, get rid of the first ‘<‘ and display the result.

UPDATES:

  • To avoid indirect self-referencing, I make sure the last remote table is not the same and the original table!
  • I had the ‘<‘ outside of the CASE statement. Now it is inside so you do not get ‘<<‘ anymore.
, paths as (
  select row_number() over( order by
    case when (r_owner, r_table_name) != ((connect_by_root(owner), connect_by_root(table_name)))
      then level end
    ) rn, 
    sys_connect_by_path(
      case level when 1 then owner||'.'||table_name || '<' end ||
      r_owner||'.'||r_table_name,
    '/') path
  from relations r
  where (r_owner, r_table_name) in (select * from search_for)
  start with (owner, table_name) in (select * from search_for)
  connect by nocycle (owner, table_name) = ((prior r_owner, prior r_table_name))
  and (owner, table_name) not in (select * from search_for)
)
select substr(translate(path,'/','<'), 2) best_path
from paths where rn = 1;

BEST_PATH
------------------------------------------------------------------------------------------------------------------
OBE.OEHR_ORDER_ITEMS<OBE.OEHR_ORDERS<OBE.OEHR_EMPLOYEES<OBE.OEHR_DEPARTMENTS<OBE.OEHR_LOCATIONS<OBE.OEHR_COUNTRIES

As the man said, hope this helps…