Splitting Strings: a New Champion!

My last post on splitting strings conceded that “a combined SQL + PL/SQL solution beats the best pure SQL solution.” Wrong! I didn’t try JSON_TABLE.

Pretend it’s an array

It’s not hard to reformat a string like

'a,b,c,d,e,"'

to be a JSON array:

select replace(json_array('a,b,c,d,e,"'), ',', '","') jstr
from dual;

JSTR
--------------------------
["a","b","c","d","e","\""]

Notice that the double quote in the data itself is properly escaped by the call to JSON_ARRAY. Then I turn the string into a series of array elements by putting double quotes around each comma. Once all that is done, JSON_TABLE will split it like this:

select subs from json_table(
  replace(json_array('a,b,c,d,e,"'), ',', '","'),
  '$[*]' columns (
    subs varchar2(4000) path '$'
  )
);

SUBS
----
a
b
c
d
e
"

So the escaped data is automatically unescaped. Now for speed, here is my test harness:

set serveroutput on
declare
  l_num number;
  l_timestamp timestamp;
  l_plsql_secs number;
  l_JSON_secs number;
  l_num_substrings number := 10;
  procedure do(p_sql in varchar2) is
  begin
    dbms_output.put_line(p_sql);
    execute immediate p_sql;
  end do;
begin
  select count(*) into l_num from user_tables where table_name = 'T';
  if l_num > 0 then
    do('drop table t purge');
  end if;
  do('create table t(id number, str varchar2(4000)) cache');
  insert into t
  select level, to_char(level,'fm000000000')||',000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010'
  from dual
  connect by level <= 10000;
  commit;
  dbms_output.put_line('Substrings' || chr(9) || 'JSON' || chr(9) || 'PL/SQL');
  for i in 1..10 loop
    select count(*) into l_num from t;
     
    l_timestamp := localtimestamp;
    select count(column_value) into l_num from (
      select id, column_value from t, table(string_tokenize(str, ','))
    );
    l_plsql_secs := extract(second from localtimestamp - l_timestamp);
     
    l_timestamp := localtimestamp;
    select count(subs) into l_num from (
      select id, subs from t, JSON_table(
      '["' || replace(str, ',' , '","') || '"]', '$[*]' columns(
      subs varchar2(99) path '$'
      )
    )
    );
    l_JSON_secs := extract(second from localtimestamp - l_timestamp);
    dbms_output.put_line(l_num_substrings || chr(9) || l_JSON_secs || chr(9) || l_plsql_secs);
    update t set str =
    str || ',000000001,000000002,000000003,000000004,000000005,000000006,000000007,000000008,000000009,000000010';
    l_num_substrings := l_num_substrings + 10;
    commit;
  end loop;
end;
/

Notice that I keep the same number of input rows here and increase the number of substrings per row.

PL/SQL does great, but JSON_TABLE appears to beat it.

Advertisements

PIVOT Function with Totals

I have updated the ADVANCED_PIVOT function to support row and/or column totals. It’s neat, but you need to know what you’re doing.

No totals

Let’s start from an example that has no totals. If we didn’t want to pivot, we could do a simple GROUP BY:

SQL> select dname, job, sum(sal)
  2  from emp join dept using(deptno)
  3  group by dname, job
  4  order by 1,2;

DNAME          JOB         SUM(SAL)
-------------- --------- ----------
ACCOUNTING     CLERK           1300
ACCOUNTING     MANAGER         2450
ACCOUNTING     PRESIDENT       5000
RESEARCH       ANALYST         3000
RESEARCH       CLERK           1900
RESEARCH       MANAGER         2975
SALES          CLERK            950
SALES          MANAGER         2850
SALES          SALESMAN        5600

If we do want to pivot, the PIVOT clause will do the aggregation for us:

SQL> var rc refcursor

SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select dname, job, sal
  5  from emp join dept using(deptno)
  6  )@',
  7  p_aggfuncs => 'sum(sal)',
  8  p_pivotfor => 'job'
  9  );
 10  end;
 11  /

print :rc

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING                      1300       2450       5000           
RESEARCH             3000       1900       2975                      
SALES                            950       2850                  5600

Horizontal total

Now suppose we want an extra column that shows the total for each row. Is that a “row total” or a “column total”? To avoid confusion, I’m going to say “horizontal total” when we get an extra column, and “vertical total” when we get an extra row.

The first thing I realized is that, if we want any totals, we have to compute them beforehand: the PIVOT clause will only be used to create the pivot columns and move the data to the right places. Here is a query to do that:

SQL> select dname, 
  2    job,
  3    sum(sal) sal
  4  from emp join dept using(deptno)
  5  group by dname, rollup(job);

DNAME          JOB              SAL
-------------- --------- ----------
SALES          CLERK            950
SALES          MANAGER         2850
SALES          SALESMAN        5600
SALES                          9400
RESEARCH       CLERK           1900
RESEARCH       ANALYST         3000
RESEARCH       MANAGER         2975
RESEARCH                       7875
ACCOUNTING     CLERK           1300
ACCOUNTING     MANAGER         2450
ACCOUNTING     PRESIDENT       5000
ACCOUNTING                     8750

Notice the use of ROLLUP(JOB) to get totals for all the jobs, but always at the DNAME level.

Notice also that the total rows have a JOB value of NULL. PIVOT will not allow us to pivot data based on a NULL column value. So we use the GROUPING() function to find out when there is a total of all jobs, and replace the NULL with a literal.

SQL> select dname, 
  2    decode(grouping(job),1,'(Total)',job) job,
  3    sum(sal) sal
  4  from emp join dept using(deptno)
  5  group by dname, rollup(job);

DNAME          JOB              SAL
-------------- --------- ----------
SALES          CLERK            950
SALES          MANAGER         2850
SALES          SALESMAN        5600
SALES          (Total)         9400
RESEARCH       CLERK           1900
RESEARCH       ANALYST         3000
RESEARCH       MANAGER         2975
RESEARCH       (Total)         7875
ACCOUNTING     CLERK           1300
ACCOUNTING     MANAGER         2450
ACCOUNTING     PRESIDENT       5000
ACCOUNTING     (Total)         8750

Notice I always put the literal ‘(Total)’ before the value. If the value is not a string, Oracle will implicitly convert it. If we said decode(grouping(<column>),0,<column>,'(Total)') and <column> was a date or a number, we would get a conversion exception.

Now that we have the totals with a valid label, let’s plug that into my function:

SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select dname, 
  5    decode(grouping(job),1,'(Total)',job) job,
  6    sum(sal) sal
  7  from emp join dept using(deptno)
  8  group by dname, rollup(job)
  9  )@',
 10  p_aggfuncs => 'sum(sal)',
 11  p_pivotfor => 'job'
 12  );
 13  end;
 14  /

DNAME             (Total)    ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- ---------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING           8750                  1300       2450       5000           
RESEARCH             7875       3000       1900       2975                      
SALES                9400                   950       2850                  5600

Oops! The Total column is in the wrong place. To fix that, I added an optional parameter p_total_label. Whenever we order by a column, I will make sure that the content of that parameter goes last. Let’s try it:

SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select dname, 
  5    decode(grouping(job),1,'(Total)',job) job,
  6    sum(sal) sal
  7  from emp join dept using(deptno)
  8  group by dname, rollup(job)
  9  )@',
 10  p_aggfuncs => 'sum(sal)',
 11  p_pivotfor => 'job',
 12  p_total_label => '(Total)'
 13  );
 14  end;
 15  /

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN    (Total)
-------------- ---------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING                      1300       2450       5000                  8750
RESEARCH             3000       1900       2975                             7875
SALES                            950       2850                  5600       9400

Warnings:

  1. Make sure you use the same literal value for p_total_label and within the p_source query! The function will never try to check for coherence, so it’s up to you.
  2. To use something like ‘(Total)’ instead of NULL, dates and numbers must be converted to strings. If that happens, the function can only order by the string value, not the original date or number value!

Vertical total

Let’s try to ROLLUP by DNAME rather than JOB, in order to get an extra row with the totals for each job.

SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select decode(grouping(dname),1,'(Total)',dname) dname, 
  5    job,
  6    sum(sal) sal
  7  from emp join dept using(deptno)
  8  group by rollup(dname), job
  9  )@',
 10  p_aggfuncs => 'sum(sal)',
 11  p_pivotfor => 'job',
 12  p_total_label => '(Total)'
 13  );
 14  end;
 15  /

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- ---------- ---------- ---------- ---------- ----------
(Total)              3000       4150       8275       5000       5600
ACCOUNTING                      1300       2450       5000           
RESEARCH             3000       1900       2975                      
SALES                            950       2850                  5600

We still have a problem with sorting the rows. The function has no idea that the DNAME column has a ‘(Total)’ row that needs to go last. To fix that, I added another optional parameter p_ordercol. The function will order the final query by the column listed here, making sure that the total line goes last.

SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select decode(grouping(dname),1,'(Total)',dname) dname, 
  5    job,
  6    sum(sal) sal
  7  from emp join dept using(deptno)
  8  group by rollup(dname), job
  9  )@',
 10  p_aggfuncs => 'sum(sal)',
 11  p_pivotfor => 'job',
 12  p_total_label => '(Total)',
 13  p_ordercol => 'dname'
 14  );
 15  end;
 16  /

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING                      1300       2450       5000           
RESEARCH             3000       1900       2975                      
SALES                            950       2850                  5600
(Total)              3000       4150       8275       5000       5600

Horizontal and vertical totals

To get both totals, we just need to change the source query. We could combine two ROLLUPs, but in this case the CUBE function is more concise:

SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select decode(grouping(dname),1,'(Total)',dname) dname, 
  5    decode(grouping(job),1,'(Total)',job) job,
  6    sum(sal) sal
  7  from emp join dept using(deptno)
  8  group by cube(dname,job)
  9  )@',
 10  p_aggfuncs => 'sum(sal)',
 11  p_pivotfor => 'job',
 12  p_total_label => '(Total)',
 13  p_ordercol => 'dname'
 14  );
 15  end;
 16  /

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN    (Total)
-------------- ---------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING                      1300       2450       5000                  8750
RESEARCH             3000       1900       2975                             7875
SALES                            950       2850                  5600       9400
(Total)              3000       4150       8275       5000       5600      26025

Two vertical totals

Now we’re getting fancy! Suppose we want totals by department, but also subtotals within each department by the year someone was hired?

  • row 10: get the hiring year
  • row 13: instead of ROLLUP on DNAME only, we ROLLUP on DNAME and HIREYEAR
  • row 5: HIREYEAR should say ‘(Total)’ only when we are on a subtotal line, not on the grand total.
  • row 18: the p_ordercol parameter can be a list of column names
SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select decode(grouping(dname),1,'(Total)',dname) dname,
  5    decode(grouping_id(dname,hireyear),1,'(Total)',hireyear) hireyear,
  6    job,
  7    sum(sal) sal
  8  from (
  9    select dname, job, sal,
 10      extract(year from hiredate) hireyear
 11    from dept join emp using(deptno)
 12  )
 13  group by rollup(dname, hireyear), job
 14  )@',
 15  p_aggfuncs => 'sum(sal)',
 16  p_pivotfor => 'job',
 17  p_total_label => '(Total)',
 18  p_ordercol => 'dname,hireyear'
 19  );
 20  end;
 21  /

DNAME          HIREYEAR     ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- --------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING     1981                                  2450       5000           
ACCOUNTING     1982                       1300                                 
ACCOUNTING     (Total)                    1300       2450       5000           
RESEARCH       1980                        800                                 
RESEARCH       1981            3000                  2975                      
RESEARCH       1987                       1100                                 
RESEARCH       (Total)         3000       1900       2975                      
SALES          1981                        950       2850                  5600
SALES          (Total)                     950       2850                  5600
(Total)                        3000       4150       8275       5000       5600

Two vertical totals, one horizontal

  • row 13: we ROLLUP on the vertical axis (DNAME, HIREYEAR) and on the horizontal (JOB)
  • row 6: JOB must never have a NULL value, else the PIVOT will fail.
SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select decode(grouping(dname),1,'(Total)',dname) dname,
  5    decode(grouping_id(dname,hireyear),1,'(Total)',hireyear) hireyear,
  6    decode(grouping(job),1,'(Total)',job) job,
  7    sum(sal) sal
  8  from (
  9    select dname, job, sal,
 10      extract(year from hiredate) hireyear
 11    from dept join emp using(deptno)
 12  )
 13  group by rollup(dname, hireyear), rollup(job)
 14  )@',
 15  p_aggfuncs => 'sum(sal)',
 16  p_pivotfor => 'job',
 17  p_total_label => '(Total)',
 18  p_ordercol => 'dname,hireyear'
 19  );
 20  end;
 21  /

DNAME          HIREYEAR     ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN    (Total)
-------------- --------- ---------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING     1981                                  2450       5000                  7450
ACCOUNTING     1982                       1300                                        1300
ACCOUNTING     (Total)                    1300       2450       5000                  8750
RESEARCH       1980                        800                                         800
RESEARCH       1981            3000                  2975                             5975
RESEARCH       1987                       1100                                        1100
RESEARCH       (Total)         3000       1900       2975                             7875
SALES          1981                        950       2850                  5600       9400
SALES          (Total)                     950       2850                  5600       9400
(Total)                        3000       4150       8275       5000       5600      26025

One vertical, two horizontal

  • row 5: HIREYEAR is now a source for a pivot column, so it must never be NULL.
  • row 16: the p_pivotfor parameter can be a list of column names
SQL> begin
  2  :rc := advanced_pivot(
  3  p_source => q'@(
  4  select decode(grouping(dname),1,'(Total)',dname) dname,
  5    decode(grouping(hireyear),1,'(Total)',hireyear) hireyear,
  6    decode(grouping(job),1,'(Total)',job) job,
  7    sum(sal) sal
  8  from (
  9    select dname, job, sal,
 10      extract(year from hiredate) hireyear
 11    from dept join emp using(deptno)
 12  )
 13  group by rollup(dname), rollup(hireyear, job)
 14  )@',
 15  p_aggfuncs => 'sum(sal)',
 16  p_pivotfor => 'hireyear,job',
 17  p_total_label => '(Total)',
 18  p_ordercol => 'dname'
 19  );
 20  end;
 21  /

DNAME          1980_CLERK 1980_(Total) 1981_ANALYST 1981_CLERK 1981_MANAGER 1981_PRESIDENT 1981_SALESMAN 1981_(Total) 1982_CLERK 1982_(Total) 1987_CLERK 1987_(Total) (Total)_(Total)
-------------- ---------- ------------ ------------ ---------- ------------ -------------- ------------- ------------ ---------- ------------ ---------- ------------ ---------------
ACCOUNTING                                                             2450           5000                       7450       1300         1300                                    8750
RESEARCH              800          800         3000                    2975                                      5975                               1100         1100            7875
SALES                                                      950         2850                         5600         9400                                                            9400
(Total)               800          800         3000        950         8275           5000          5600        22825       1300         1300       1100         1100           26025

Conclusion

Did you really read down to here? If so, congratulations! Here is what I wish you would take away from this post:

  • The ROLLUP and CUBE extensions to GROUP BY are very powerful means to get various levels of subtotals, totals and grand totals.
  • The PIVOT clause can manage vertical and / or horizontal totals very cleanly.
  • The source query should do one ROLLUP on the vertical columns and another on the horizontal columns, except when there is exactly one of each: then just use CUBE.
  • The ADVANCED_PIVOT function generates the appropriate PIVOT clause based on the data, and can handle multiple levels of vertical and / or horizontal totals.
      • The horizontal “p_pivotfor” columns must never be NULL.
      • The vertical “p_ordercol” columns can be NULL or labeled, your choice. If there are dates or numbers, NULL allows for sorting by the original values.

Improved PIVOT Function

My generic PIVOT function only allowed one column in the FOR clause. The modestly named “Advanced” function has no such limit.

Here are some test runs to put it through its paces. I’ll save the code for the end.

UPDATE 2018-05-31: I added support for horizontal and vertical totals. The most recent code is still at the bottom of this post, but the additional explanations are here: PIVOT Function with Totals

NOTE: I changed the order of the input parameters to match the order of the PIVOT statement itself: aggregation functions before the pivot_for_clause.

One aggregation function, one FOR column

SQL> set serveroutput off
SQL> var rc refcursor
SQL> begin
  2    :rc := advanced_pivot(
  3      '(select dname, job, sal from DEPT join EMP using(DEPTNO))', 
  4      'AVG(SAL)', 
  5      'JOB'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING                      1300       2450       5000           
RESEARCH             3000        950       2975                      
SALES                            950       2850                  1400

Transposing the data and aliasing the aggregate

SQL> begin
  2    :rc := advanced_pivot(
  3      '(select dname, job, sal from DEPT join EMP using(DEPTNO))', 
  4      'AVG(SAL) AVG_SAL', 
  5      'DNAME'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc


JOB       ACCOUNTING_AVG_SAL RESEARCH_AVG_SAL SALES_AVG_SAL
--------- ------------------ ---------------- -------------
ANALYST                                  3000              
CLERK                   1300              950           950
MANAGER                 2450             2975          2850
PRESIDENT               5000                               
SALESMAN                                               1400

One aggregate, two FOR columns

The pivot columns are ordered by DNAME, JOB from left to right.

SQL> begin
  2    :rc := advanced_pivot(
  3      '(select dname, job, sal from DEPT join EMP using(DEPTNO))', 
  4      'MAX(SAL) MAX_SAL', 
  5      'DNAME,JOB'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc


ACCOUNTING_CLERK_MAX_SAL ACCOUNTING_MANAGER_MAX_SAL ACCOUNTING_PRESIDENT_MAX_SAL RESEARCH_ANALYST_MAX_SAL RESEARCH_CLERK_MAX_SAL RESEARCH_MANAGER_MAX_SAL SALES_CLERK_MAX_SAL SALES_MANAGER_MAX_SAL SALES_SALESMAN_MAX_SAL
------------------------ -------------------------- ---------------------------- ------------------------ ---------------------- ------------------------ ------------------- --------------------- ----------------------
                    1300                       2450                         5000                     3000                   1100                     2975                 950                  2850                   1600

Shorter pivot column names

The pivot column names can get long real fast. With this generic function, the way to shorten the names is to shorten the input data.

SQL> begin
  2    :rc := advanced_pivot(
  3      '(select initcap(substr(dname,1,4)) dname, initcap(substr(job,1,5)) job, sal from DEPT join EMP using(DEPTNO))', 
  4      'MIN(SAL) "Min_Sal"', 
  5      'JOB,DNAME'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc


Analy_Rese_Min_Sal Clerk_Acco_Min_Sal Clerk_Rese_Min_Sal Clerk_Sale_Min_Sal Manag_Acco_Min_Sal Manag_Rese_Min_Sal Manag_Sale_Min_Sal Presi_Acco_Min_Sal Sales_Sale_Min_Sal
------------------ ------------------ ------------------ ------------------ ------------------ ------------------ ------------------ ------------------ ------------------
              3000               1300                800                950               2450               2975               2850               5000               1250

Two aggregates, two FOR columns

I would rather not do this by hand…

SQL> begin
  2    :rc := advanced_pivot(
  3      '(select substr(dname,1,2) dname, substr(job,1,2) job, sal from DEPT join EMP using(DEPTNO))', 
  4      'count(*),AVG(SAL) AVG', 
  5      'DNAME,JOB'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc


     AC_CL  AC_CL_AVG      AC_MA  AC_MA_AVG      AC_PR  AC_PR_AVG      RE_AN  RE_AN_AVG      RE_CL  RE_CL_AVG      RE_MA  RE_MA_AVG      SA_CL  SA_CL_AVG      SA_MA  SA_MA_AVG      SA_SA  SA_SA_AVG
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         1       1300          1       2450          1       5000          1       3000          2        950          1       2975          1        950          1       2850          4       1400

What if the FOR column is a date?

I want to break down the number of hires in each department by month, so why not truncate HIREDATE by month, then just display the month and the year?

SQL> begin
  2    :rc := advanced_pivot(
  3      '(select dname, to_char(trunc(hiredate,''MM''),''MON-yyyy'') hiredate from DEPT join EMP using(DEPTNO))', 
  4      'count(*)', 
  5      'hiredate'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc


DNAME            APR-1981   DEC-1980   DEC-1981   FEB-1981   JAN-1982   JUN-1981   MAY-1981   MAY-1987   NOV-1981   SEP-1981
-------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING              0          0          0          0          1          1          0          0          1          0
RESEARCH                1          1          1          0          0          0          0          1          0          0
SALES                   0          0          1          2          0          0          1          0          0          2

Oops! The input HIREDATE is now a string, so the pivot column names are ordered by the string values.

Ordering dates in date order

Here is a situation where implicit conversion is your friend. Just keep the truncated date in your input query and set NLS_DATE_FORMAT as you wish. I’ll order the columns by date and Oracle will convert the dates to strings in the format you specified.

SQL> alter session set nls_date_format='mon-yyyy';

Session altered.

SQL> begin
  2    :rc := advanced_pivot(
  3      '(select dname, trunc(hiredate,''MM'') hiredate from DEPT join EMP using(DEPTNO))', 
  4      'count(*)', 
  5      'hiredate'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc


DNAME            dec-1980   feb-1981   apr-1981   may-1981   jun-1981   sep-1981   nov-1981   dec-1981   jan-1982   may-1987
-------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING              0          0          0          0          1          0          1          0          1          0
RESEARCH                1          0          1          0          0          0          0          1          0          1
SALES                   0          2          0          1          0          2          0          1          0          0

The Code

create or replace function advanced_pivot(
  p_source in varchar2,       -- table, view or query in parentheses
  p_aggfuncs in varchar2,     -- one or more aggregation functions with or without aliases
  p_pivotfor in varchar2,     -- one or more columns from the input separated by commas
  p_total_label in varchar2 default null, -- label to use for pivot totals (if CUBE, ROLLUP, etc.)
  p_ordercol in varchar2 default null     -- column(s) to order by with p_total_label last
) return sys_refcursor authid current_user is
/*
Calculates pivot_in_list using SQL 1, updates SQL 2 text and opens ref cursor.
- Pivot_in_list concatenates values from all PIVOT_FOR columns
  for example, if (DNAME,JOB) then
    ('ACCOUNTING','CLERK') as "ACCOUNTING_CLERK",
    ('ACCOUNTING','MANAGER') as "ACCOUNTING_MANAGER",
  and so on
*/
  l_sql sys.odcivarchar2list := sys.odcivarchar2list(
  
q'@select listagg('(''' || expr || ''') as "' || al || '"', ',
')
within group(order by #ORDERFOR#)
from (select distinct
#EXPR# expr,
#ALIAS# al,
#PIVOTFOR#
from #SOURCE#)@',

'select * from #SOURCE#
pivot(#AGGFUNCS# for (#PIVOTFOR#) in (
#PIVOTINLIST#
)) order by #ORDERCOL#1,2,3'
  );
  l_pivotfor varchar2(255);
  l_orderfor varchar2(255);
  l_refcur sys_refcursor;
  l_pivotinlist varchar2(32767);
  l_expr varchar2(4000);
  l_alias varchar2(4000);
  l_ordercol varchar2(255);
begin
  l_pivotfor := trim( ',' from replace(p_pivotfor,' ') );
  l_orderfor := 
    case when p_total_label is null 
      then l_pivotfor
      else 'nullif(' 
        || replace(l_pivotfor, ',', ','''||p_total_label||'''),nullif(') 
        || ','''||p_total_label||''')'
    end;
  l_sql(1) := replace(l_sql(1), '#ORDERFOR#', l_orderfor);
  
  l_expr := replace(l_pivotfor,',',q'%||''','''||%');
  l_sql(1) := replace(l_sql(1), '#EXPR#', l_expr);
  
  l_alias := replace(l_pivotfor,',',q'%||'_'||%');
  l_sql(1) := replace(l_sql(1), '#ALIAS#', l_alias);
  
  for i in 1..l_sql.count loop
    l_sql(i) := replace(l_sql(i), '#SOURCE#', p_source);
    l_sql(i) := replace(l_sql(i), '#PIVOTFOR#', l_pivotfor);
  end loop;
  
  dbms_output.put_line(l_sql(1));
  dbms_output.put_line('/');
  open l_refcur for l_sql(1);
  fetch l_refcur into l_pivotinlist;
  close l_refcur;
  
  l_sql(2) := replace(l_sql(2), '#AGGFUNCS#', p_aggfuncs);
  l_sql(2) := replace(l_sql(2), '#PIVOTINLIST#', l_pivotinlist);
  
  
  l_ordercol := trim( ',' from replace(p_ordercol,' ') );
  l_ordercol :=
    case when p_total_label is null or l_ordercol is null
      then l_ordercol
      else 'nullif(' 
        || replace(l_ordercol, ',', ','''||p_total_label||'''),nullif(') 
        || ','''||p_total_label||'''),'
    end;
  l_sql(2) := replace(l_sql(2), '#ORDERCOL#', l_ordercol);
  
  dbms_output.put_line(l_sql(2));
  dbms_output.put_line('/');
  open l_refcur for l_sql(2);
  return l_refcur;
  
end advanced_pivot;
/

I really enjoyed the challenge of writing a truly generic function. I hope you enjoy using it!

Get all the Foreign Key Relationships

An old asktom question was recently highlighted on Twitter: All Parent – Child tables in the database. I contributed a new solution, but here is one I think is more complete.

The Requirement: foreign key relationships

The questioner wanted to show the hierarchy of “parent-child” relationships, starting from the top ancestor and going down to the last descendant. He asked for the entire database.

By “parent-child” we mean foreign key relationships: the “parent” has a primary key constraint or a unique constraint, and the “child” has a foreign key constraint that references the parent key.

I assume I should omit any table that is neither parent nor child. I also add an IS_CYCLE column: if = 1, a CONNECT_BY_ISCYCLE condition was met. Finally, I show whether the referenced key is Primary or Unique.

Building the solution

Since we are working on the whole database, DBA_CONSTRAINTS is the appropriate source. For just my schema, I can use USER_CONSTRAINTS instead. For readability, I’ll restrict the output to the HR schema.

The basic information we need includes the owner, constraint name, constraint type and table name, together with the referenced constraint and its owner. I include only constraints of type ‘P’ (Primary), ‘U’ (Unique), and ‘R’ (foreign key Reference). In recent versions, I can restrict my query to owners that are not “Oracle maintained”.

with pur as (
  select OWNER, CONSTRAINT_NAME, CONSTRAINT_TYPE, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner in (select username from dba_users where oracle_maintained = 'N')
)
select * from pur;
OWNER CONSTRAINT_NAME CONSTRAINT_TYPE TABLE_NAME R_OWNER R_CONSTRAINT_NAME
HR COUNTRY_C_ID_PK P COUNTRIES
HR COUNTR_REG_FK R COUNTRIES HR REG_ID_PK
HR DEPT_ID_PK P DEPARTMENTS
HR DEPT_LOC_FK R DEPARTMENTS HR LOC_ID_PK
HR DEPT_MGR_FK R DEPARTMENTS HR EMP_EMP_ID_PK
HR EMP_EMP_ID_PK P EMPLOYEES
HR EMP_DEPT_FK R EMPLOYEES HR DEPT_ID_PK
HR EMP_JOB_FK R EMPLOYEES HR JOB_ID_PK
HR EMP_MANAGER_FK R EMPLOYEES HR EMP_EMP_ID_PK
HR EMP_EMAIL_UK U EMPLOYEES
HR JOB_ID_PK P JOBS
HR JHIST_EMP_ID_ST_DATE_PK P JOB_HISTORY
HR JHIST_DEPT_FK R JOB_HISTORY HR DEPT_ID_PK
HR JHIST_EMP_FK R JOB_HISTORY HR EMP_EMP_ID_PK
HR JHIST_JOB_FK R JOB_HISTORY HR JOB_ID_PK
HR LOC_ID_PK P LOCATIONS
HR LOC_C_ID_FK R LOCATIONS HR COUNTRY_C_ID_PK
HR REG_ID_PK P REGIONS

 

Now we need to add the table the referenced constraints belong to. I took the easy path and did a self join.

with pur as (
  select OWNER, CONSTRAINT_NAME, CONSTRAINT_TYPE, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner in (select username from dba_users where oracle_maintained = 'N')
)
, relations as (
  select a.owner, a.table_name, a.r_owner,
    b.table_name r_table_name,
    b.constraint_type r_constraint_type
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  --and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)
select * from relations
OWNER TABLE_NAME R_OWNER R_TABLE_NAME R_CONSTRAINT_TYPE
HR COUNTRIES HR REGIONS P
HR DEPARTMENTS HR EMPLOYEES P
HR DEPARTMENTS HR LOCATIONS P
HR EMPLOYEES HR DEPARTMENTS P
HR EMPLOYEES HR EMPLOYEES P
HR EMPLOYEES HR JOBS P
HR JOB_HISTORY HR DEPARTMENTS P
HR JOB_HISTORY HR EMPLOYEES P
HR JOB_HISTORY HR JOBS P
HR LOCATIONS HR COUNTRIES P

 

Note that EMPLOYEES references itself.

UPDATE: I added a commented line at the end of RELATIONS. If you want to eliminate self-references, uncomment that line.

Now we need to identify the starting points for the hierarchical query. These are tables that are referenced, but do not reference any other table – though they can reference themselves. We’ll add those to the RELATIONS result set:

with pur as (
  select OWNER, CONSTRAINT_NAME, CONSTRAINT_TYPE, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner in (select username from dba_users where oracle_maintained = 'N')
)
, relations as (
  select a.owner, a.table_name, a.r_owner,
    b.table_name r_table_name,
    b.constraint_type r_constraint_type
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  -- and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)
, with_parents as (
  select * from relations
  union
  select r_owner, r_table_name, null, null, null
  from relations 
  where (r_owner, r_table_name) not in (
    select owner, table_name
    from relations
    where (owner, table_name) != ((r_owner, r_table_name))
  )
) select * from with_parents
OWNER TABLE_NAME R_OWNER R_TABLE_NAME R_CONSTRAINT_TYPE
HR COUNTRIES HR REGIONS P
HR DEPARTMENTS HR EMPLOYEES P
HR DEPARTMENTS HR LOCATIONS P
HR EMPLOYEES HR DEPARTMENTS P
HR EMPLOYEES HR EMPLOYEES P
HR EMPLOYEES HR JOBS P
HR JOBS
HR JOB_HISTORY HR DEPARTMENTS P
HR JOB_HISTORY HR EMPLOYEES P
HR JOB_HISTORY HR JOBS P
HR LOCATIONS HR COUNTRIES P
HR REGIONS

 

Finally, we can start from the rows where R_OWNER is null and go down the hierarchy:

with pur as (
  select OWNER, CONSTRAINT_NAME, CONSTRAINT_TYPE, TABLE_NAME,
    R_OWNER, R_CONSTRAINT_NAME
  from dba_constraints
  where constraint_type in('P','U','R')
  and owner in (select username from dba_users where oracle_maintained = 'N')
)
, relations as (
  select a.owner, a.table_name, a.r_owner,
    b.table_name r_table_name,
    b.constraint_type r_constraint_type
  from pur a join pur b
  on (a.R_OWNER, a.R_CONSTRAINT_NAME) = ((b.OWNER, b.CONSTRAINT_NAME))
  -- and (a.OWNER, a.table_name) != ((b.OWNER, b.table_name))
)
, with_parents as (
  select * from relations
  union
  select r_owner, r_table_name, null, null, null
  from relations 
  where (r_owner, r_table_name) not in (
    select owner, table_name
    from relations
    where (owner, table_name) != ((r_owner, r_table_name))
  )
)
select level LVL, OWNER, TABLE_NAME, R_OWNER, R_TABLE_NAME, R_CONSTRAINT_TYPE,
  connect_by_iscycle is_cycle
from with_parents
start with r_owner is null
connect by nocycle (r_owner, r_table_name) = ((prior owner, prior table_name))
order siblings by owner, table_name;
LVL OWNER TABLE_NAME R_OWNER R_TABLE_NAME R_CONSTRAINT_TYPE IS_CYCLE
1 HR JOBS 0
2 HR EMPLOYEES HR JOBS P 1
3 HR DEPARTMENTS HR EMPLOYEES P 1
4 HR JOB_HISTORY HR DEPARTMENTS P 0
3 HR JOB_HISTORY HR EMPLOYEES P 0
2 HR JOB_HISTORY HR JOBS P 0
1 HR REGIONS 0
2 HR COUNTRIES HR REGIONS P 0
3 HR LOCATIONS HR COUNTRIES P 0
4 HR DEPARTMENTS HR LOCATIONS P 0
5 HR EMPLOYEES HR DEPARTMENTS P 1
6 HR JOB_HISTORY HR EMPLOYEES P 0
5 HR JOB_HISTORY HR DEPARTMENTS P 0

 

Hope this helps…

Generic Pivot Function

Many folks ask how to do a “dynamic pivot”. It’s often best to use a preliminary query to generate the PIVOT clause. Here’s a generic function to do just that.

[UPDATE 2018-07-31: for advanced use cases, I wrote a more powerful function called ADVANCED_PIVOT, which I explain here: Improved PIVOT Function]

Anatomy of PIVOT

PIVOT clause from Oracle documentation

Let’s take a simple example, getting the average salary by department and by job:

select dname, job, avg(sal) avg_sal
from DEPT join EMP using(DEPTNO)
group by dname, job
order by 1,2;
DNAME JOB AVG_SAL
ACCOUNTING CLERK 1300
ACCOUNTING MANAGER 2450
ACCOUNTING PRESIDENT 5000
RESEARCH ANALYST 3000
RESEARCH CLERK 950
RESEARCH MANAGER 2975
SALES CLERK 950
SALES MANAGER 2850
SALES SALESMAN 1400

 

We want three rows, one per department, and five “pivot columns” for each of the five distinct jobs. The JOB column data become pivot column names and the SAL column data become pivot column data. The desired result is:

DNAME ANALYST CLERK MANAGER PRESIDENT SALESMAN
ACCOUNTING 1300 2450 5000
RESEARCH 3000 950 2975
SALES 950 2850 1400

 

A PIVOT clause needs:

  1. One row source for input: a table, view or query in parentheses. Each column in the input must have a unique name. The row source does not need to do a GROUP BY: the PIVOT clause will do the aggregation.
  2. One or more aggregate functions: they produce the data for the pivot columns.
  3. a PIVOT_FOR clause with one or more columns whose values will correspond to new column names.
  4. a PIVOT_IN clause that lists the pivot columns with their source values.
select * from (
  select dname, job, sal from DEPT join EMP using(DEPTNO)
)
pivot(
  AVG(SAL)
  for JOB
  in (
    'ANALYST' as "ANALYST",
    'CLERK' as "CLERK",
    'MANAGER' as "MANAGER",
    'PRESIDENT' as "PRESIDENT",
    'SALESMAN' as "SALESMAN"
  )
) order by 1
  • line 2 is the input.
  • line 5 is the aggregate function.
    There will be an implicit GROUP BY of DNAME (because not mentioned in the PIVOT) and JOB (because mentioned in the PIVOT_FOR clause).
  • line 6 is the PIVOT_FOR clause: JOB
  • lines 7 through 13 are the PIVOT_IN clause.
    In line 8, a pivot column called ANALYST is defined for when JOB = ‘ANALYST’. The result of the AVG(SAL) function for DNAME = ‘RESEARCH’ and JOB = ‘ANALYST’ will go into the ‘RESEARCH’ row and the ANALYST column.

A Generic PIVOT Template

I like to use what I call “SQL templates“. They are a way of describing a pattern: you distinguish what varies each time from what stays the same, and you put placeholders where the variable stuff should go. For PIVOT, the template has four placeholders.

select * from #SOURCE#
pivot(#AGGFUNCS# for #PIVOTFOR# in (
#PIVOTINLIST#
)) order by 1,2
  1. #SOURCE# is the row source: table, view, query in parentheses.
  2. #AGGFUNCS# is one or more aggregate functions separated by commas, with or without aliases.
  3. #PIVOTFOR# must contain exactly one column name from the source. This is a limitation of my function.
  4. #PIVOTINLIST# lists the pivot columns.

My function gets the first three placeholders from the input and generates the fourth one.

Generating the value for #PIVOTINLIST#

The problem with PIVOT is that the list of pivot columns depends on the data, so we need to know the data before querying it! In our example, we list five column names for the five JOBs in the EMP table, but how do we know what those jobs are?

My function uses a preliminary query to get that list. It has its own template using two of the same placeholders:

q'£select listagg('''' || #PIVOTFOR# || ''' as "' || #PIVOTFOR# || '"', ',')
within group(order by #PIVOTFOR#)
from (select distinct #PIVOTFOR# from #SOURCE#)£'

The Code

create or replace function generic_pivot(
  p_source in varchar2,   -- table, view or query in parentheses
  p_pivotfor in varchar2, -- one column from the input
  p_aggfuncs in varchar2  -- one or more aggregation functions
) return sys_refcursor is

-- Calculates pivot_in_list using SQL 1, updates SQL 2 text and opens ref cursor.
-- Pivot_for_clause can contain one column

  l_sql sys.odcivarchar2list := sys.odcivarchar2list(

q'£select listagg('''' || #PIVOTFOR# || ''' as "' || #PIVOTFOR# || '"', ',')
within group(order by #PIVOTFOR#)
from (select distinct #PIVOTFOR# from #SOURCE#)£',

'select * from #SOURCE#
pivot(#AGGFUNCS# for #PIVOTFOR# in (
#PIVOTINLIST#
)) order by 1,2'
  );
  l_refcur sys_refcursor;
  l_pivotinlist varchar2(32767);
begin
  for i in 1..l_sql.count loop
    l_sql(i) := replace(l_sql(i), '#SOURCE#', p_source);
    l_sql(i) := replace(l_sql(i), '#PIVOTFOR#', p_pivotfor);
  end loop;
  dbms_output.put_line(l_sql(1));
  dbms_output.put_line('/');
  open l_refcur for l_sql(1);
  fetch l_refcur into l_pivotinlist;
  close l_refcur;
  l_sql(2) := replace(l_sql(2), '#AGGFUNCS#', p_aggfuncs);
  l_sql(2) := replace(l_sql(2), '#PIVOTINLIST#', l_pivotinlist);
  dbms_output.put_line(l_sql(2));
  dbms_output.put_line('/');
  open l_refcur for l_sql(2);
  return l_refcur;
end generic_pivot;
/

Test results

SQL> set serveroutput off
SQL> var rc refcursor
SQL> begin
  2    :rc := generic_pivot(
  3      '(select dname, job, sal from DEPT join EMP using(DEPTNO))', 
  4      'JOB', 
  5      'AVG(SAL)'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc

DNAME             ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
-------------- ---------- ---------- ---------- ---------- ----------
ACCOUNTING                      1300       2450       5000           
RESEARCH             3000        950       2975                      
SALES                            950       2850                  1400


SQL> begin
  2    :rc := generic_pivot(
  3      '(select dname, job, sal from DEPT join EMP using(DEPTNO))', 
  4      'DNAME', 
  5      'AVG(SAL)'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc

JOB       ACCOUNTING   RESEARCH      SALES
--------- ---------- ---------- ----------
ANALYST                    3000           
CLERK           1300        950        950
MANAGER         2450       2975       2850
PRESIDENT       5000                      
SALESMAN                              1400


SQL> begin
  2    :rc := generic_pivot(
  3      '(select dname, job, sal from DEPT join EMP using(DEPTNO))', 
  4      'JOB', 
  5      'MIN(SAL) min, MAX(SAL) max'
  6    );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> print :rc

DNAME          ANALYST_MIN ANALYST_MAX  CLERK_MIN  CLERK_MAX MANAGER_MIN MANAGER_MAX PRESIDENT_MIN PRESIDENT_MAX SALESMAN_MIN SALESMAN_MAX
-------------- ----------- ----------- ---------- ---------- ----------- ----------- ------------- ------------- ------------ ------------
ACCOUNTING                                   1300       1300        2450        2450          5000          5000                          
RESEARCH              3000        3000        800       1100        2975        2975                                                      
SALES                                         950        950        2850        2850                                     1250         1600

Hope this helps!

More on Comparing Columns across Rows

My previous post got several replies about other ways to compare columns across rows. Here’s my take on all of them.

Test data (again)

create table t as
with data as (
  select level n from dual connect by level <= 3
)
, tests as (
  select row_number() over (order by a.n, b.n) test_id,
  nullif(a.n, 3) a, 
  nullif(b.n, 3) b,
  case a.n when b.n then 'Same' else 'Different' end status
  from data a, data b
)
select test_id, tbl, n, status
from tests
unpivot include nulls (n for tbl in(A,B));

select * from t order by 1,2;
TEST_ID TBL N STATUS
1 A 1 Same
1 B 1 Same
2 A 1 Different
2 B 2 Different
3 A 1 Different
3 B Different
4 A 2 Different
4 B 1 Different
5 A 2 Same
5 B 2 Same
6 A 2 Different
6 B Different
7 A Different
7 B 1 Different
8 A Different
8 B 2 Different
9 A Same
9 B Same

 

My solution (again)

It looks like the most concise test for a difference is COUNT(N) BETWEEN 1 and COUNT(DISTINCT N).

select t.*,
count(distinct n) over(partition by test_id) cnt_distinct,
count(n) over(partition by test_id) cnt,
case when count(n) over(partition by test_id)
  between 1 and count(distinct n) over(partition by test_id)
  then 'Different' else 'Same' end new_status
from t
order by 4,5,6,1,2;
TEST_ID TBL N STATUS CNT_DISTINCT CNT NEW_STATUS
3 A 1 Different 1 1 Different
3 B Different 1 1 Different
6 A 2 Different 1 1 Different
6 B Different 1 1 Different
7 A Different 1 1 Different
7 B 1 Different 1 1 Different
8 A Different 1 1 Different
8 B 2 Different 1 1 Different
2 A 1 Different 2 2 Different
2 B 2 Different 2 2 Different
4 A 2 Different 2 2 Different
4 B 1 Different 2 2 Different
9 A Same 0 0 Same
9 B Same 0 0 Same
1 A 1 Same 1 2 Same
1 B 1 Same 1 2 Same
5 A 2 Same 1 2 Same
5 B 2 Same 1 2 Same

 

Using the DUMP() function (Andrej Pashchenko)

Since the DUMP() function returns a non-null result in every case, we can simply test

COUNT( DISTINCT DUMP(n) ) OVER (PARTITION BY TEST_ID)

If the result is 1, N is the same, otherwise it is different.

This solution is more concise, but in my tests it ran 10 times slower. The DUMP() function chews up some CPU.

Using SYS_OP_MAP_NONNULL() (Tony Hasler)

As with DUMP(), we can test

COUNT( DISTINCT SYS_OP_MAP_NONNULL(n) ) OVER (PARTITION BY TEST_ID)

Again a concise solution. Surprisingly, it runs twice as slow as my solution – and despite one mention in the documentation I doubt that this function is supported.

Simplifying the comparison

Zahar Hilkevich suggested simply comparing the two counts:

select t.*,
count(distinct n) over(partition by test_id) cnt_distinct,
count(n) over(partition by test_id) cnt,
case when count(n)          over(partition by test_id)
        = count(distinct n) over(partition by test_id)
  then 'Different' else 'Same' end new_status
from t
order by 4,5,6,1,2;
TEST_ID TBL N STATUS CNT_DISTINCT CNT NEW_STATUS
3 A 1 Different 1 1 Different
3 B Different 1 1 Different
6 A 2 Different 1 1 Different
6 B Different 1 1 Different
7 A Different 1 1 Different
7 B 1 Different 1 1 Different
8 A Different 1 1 Different
8 B 2 Different 1 1 Different
2 A 1 Different 2 2 Different
2 B 2 Different 2 2 Different
4 A 2 Different 2 2 Different
4 B 1 Different 2 2 Different
9 A Same 0 0 Different
9 B Same 0 0 Different
1 A 1 Same 1 2 Same
1 B 1 Same 1 2 Same
5 A 2 Same 1 2 Same
5 B 2 Same 1 2 Same

 

Notice that in test case 9, when both values are null, Zahar’s test says they are “different”. This is not a problem in the context of displaying different values, since the output values will be null no matter what.

Conclusion

Thanks to those who mentioned other more concise solutions. Despite my taste for concision, I think the solution I proposed is correct and efficient. The choice is yours!

Actually Seeing the Differences

When comparing data between two tables, it’s one thing to query the differences and another thing to actually see them.

I have two tables EMP and EMP2. Here are the rows that are different. Quick, what columns have changed?

EMPNO TBL ROW_CNT ENAME JOB MGR SAL COMM
7369 EMP 2 SMITH CLERK 7902 800
7369 EMP2 2 SMITE CLERK 7902 800
7499 EMP 2 ALLEN SALESMAN 7698 1600 300
7499 EMP2 2 ALLEN SALESGUY 7698 1600 300
7521 EMP 2 WARD SALESMAN 7698 1250 500
7521 EMP2 2 WARD SALESMAN 7788 1250 500
7654 EMP 2 MARTIN SALESMAN 7698 1250 1400
7654 EMP2 2 MARTIN SALESMAN 7698 1750 1400
7698 EMP 2 BLAKE MANAGER 7839 2850
7698 EMP2 2 BLAKE MANAGER 7839 2850 1000
7788 EMP2 1 SCOTT ANALYST 7566 3000
7902 EMP 1 FORD ANALYST 7566 3000

 

I thought so. Now suppose I blank out the columns that are the same in both tables?

EMPNO TBL ROW_CNT ENAME JOB MGR SAL COMM
7369 EMP 2 SMITH
7369 EMP2 2 SMITE
7499 EMP 2 SALESMAN
7499 EMP2 2 SALESGUY
7521 EMP 2 7698
7521 EMP2 2 7788
7654 EMP 2 1250
7654 EMP2 2 1750
7698 EMP 2
7698 EMP2 2 1000
7788 EMP2 1 SCOTT ANALYST 7566 3000
7902 EMP 1 FORD ANALYST 7566 3000

 

That’s better. Now, how can I do that?

Comparing columns across rows

I need to compare two columns in two different rows, and I need the result of the comparison in each row. That sounds like a job for analytic functions. It would be hard to use LAG() or LEAD() because one row would need LEAD() and the other would need LAG(). I finally came up with a way to use COUNT().

For testing, I created a little table T:

TEST_ID TBL N STATUS
1 A 1 Same
1 B 1 Same
2 A 1 Different
2 B 2 Different
3 A 1 Different
3 B Different
4 A 2 Different
4 B 1 Different
5 A 2 Same
5 B 2 Same
6 A 2 Different
6 B Different
7 A Different
7 B 1 Different
8 A Different
8 B 2 Different
9 A Same
9 B Same

 

If I use COUNT(DISTINCT N) I should get either 1 (same values) or 2 (different values) and I’m done: wrong! When I count N or DISTINCT N, null values don’t count. So I thought of comparing COUNT(N) and COUNT(DISTINCT N).

select t.*,
count(distinct n) over(partition by test_id) cnt_distinct,
count(n) over(partition by test_id) cnt
from t
order by 4,5,6,1,2;
TEST_ID TBL N STATUS CNT_DISTINCT CNT
3 A 1 Different 1 1
3 B Different 1 1
6 A 2 Different 1 1
6 B Different 1 1
7 A Different 1 1
7 B 1 Different 1 1
8 A Different 1 1
8 B 2 Different 1 1
2 A 1 Different 2 2
2 B 2 Different 2 2
4 A 2 Different 2 2
4 B 1 Different 2 2
9 A Same 0 0
9 B Same 0 0
1 A 1 Same 1 2
1 B 1 Same 1 2
5 A 2 Same 1 2
5 B 2 Same 1 2

 

It looks like the most concise test for a difference is COUNT(N) BETWEEN 1 and COUNT(DISTINCT N).

select t.*,
count(distinct n) over(partition by test_id) cnt_distinct,
count(n) over(partition by test_id) cnt,
case when count(n) over(partition by test_id)
  between 1 and count(distinct n) over(partition by test_id)
  then 'Different' else 'Same' end new_status
from t
order by 4,5,6,1,2;
TEST_ID TBL N STATUS CNT_DISTINCT CNT NEW_STATUS
3 A 1 Different 1 1 Different
3 B Different 1 1 Different
6 A 2 Different 1 1 Different
6 B Different 1 1 Different
7 A Different 1 1 Different
7 B 1 Different 1 1 Different
8 A Different 1 1 Different
8 B 2 Different 1 1 Different
2 A 1 Different 2 2 Different
2 B 2 Different 2 2 Different
4 A 2 Different 2 2 Different
4 B 1 Different 2 2 Different
9 A Same 0 0 Same
9 B Same 0 0 Same
1 A 1 Same 1 2 Same
1 B 1 Same 1 2 Same
5 A 2 Same 1 2 Same
5 B 2 Same 1 2 Same

 

Finally, I apply this technique to each non-PK column in my EMP comparison and I get the desired result.

select EMPNO,
  case NEW_CNT when 1 then 'EMP2' else 'EMP' end tbl,
  ROW_CNT,
  case when count(ENAME) over(partition by EMPNO)
    between 1 and count(distinct ENAME) over(partition by EMPNO)
    then ENAME end ENAME,
  case when count(JOB) over(partition by EMPNO)
    between 1 and count(distinct JOB) over(partition by EMPNO)
    then JOB end JOB,
  case when count(MGR) over(partition by EMPNO)
    between 1 and count(distinct MGR) over(partition by EMPNO)
    then MGR end MGR,
  case when count(SAL) over(partition by EMPNO)
    between 1 and count(distinct SAL) over(partition by EMPNO)
    then SAL end SAL,
  case when count(COMM) over(partition by EMPNO)
    between 1 and count(distinct COMM) over(partition by EMPNO)
    then COMM end COMM
FROM (
  select
    EMPNO, ENAME, JOB, MGR, SAL, COMM,
    sum(NEW_CNT) NEW_CNT, count(*) over(partition by EMPNO) ROW_CNT
  FROM (
    select 
    EMPNO, ENAME, JOB, MGR, SAL, COMM,
    -1 NEW_CNT
    from EMP O
    union all
    select
    EMPNO, ENAME, JOB, MGR, SAL, COMM,
    1 NEW_CNT
    from emp2 N
  )
  group by
    EMPNO, ENAME, JOB, MGR, SAL, COMM
  having sum(NEW_CNT) != 0
)
order by 1, 2, new_cnt;

Conclusion

By using two COUNT() analytic functions, I can tell whether two columns in two different rows are the same or not, considering two NULLs to be “the same”. This allows me to compare rows, then to compare columns and blank out all but the true differences.