Actually Seeing the Differences

When comparing data between two tables, it’s one thing to query the differences and another thing to actually see them.

I have two tables EMP and EMP2. Here are the rows that are different. Quick, what columns have changed?

EMPNO TBL ROW_CNT ENAME JOB MGR SAL COMM
7369 EMP 2 SMITH CLERK 7902 800
7369 EMP2 2 SMITE CLERK 7902 800
7499 EMP 2 ALLEN SALESMAN 7698 1600 300
7499 EMP2 2 ALLEN SALESGUY 7698 1600 300
7521 EMP 2 WARD SALESMAN 7698 1250 500
7521 EMP2 2 WARD SALESMAN 7788 1250 500
7654 EMP 2 MARTIN SALESMAN 7698 1250 1400
7654 EMP2 2 MARTIN SALESMAN 7698 1750 1400
7698 EMP 2 BLAKE MANAGER 7839 2850
7698 EMP2 2 BLAKE MANAGER 7839 2850 1000
7788 EMP2 1 SCOTT ANALYST 7566 3000
7902 EMP 1 FORD ANALYST 7566 3000

 

I thought so. Now suppose I blank out the columns that are the same in both tables?

EMPNO TBL ROW_CNT ENAME JOB MGR SAL COMM
7369 EMP 2 SMITH
7369 EMP2 2 SMITE
7499 EMP 2 SALESMAN
7499 EMP2 2 SALESGUY
7521 EMP 2 7698
7521 EMP2 2 7788
7654 EMP 2 1250
7654 EMP2 2 1750
7698 EMP 2
7698 EMP2 2 1000
7788 EMP2 1 SCOTT ANALYST 7566 3000
7902 EMP 1 FORD ANALYST 7566 3000

 

That’s better. Now, how can I do that?

Comparing columns across rows

I need to compare two columns in two different rows, and I need the result of the comparison in each row. That sounds like a job for analytic functions. It would be hard to use LAG() or LEAD() because one row would need LEAD() and the other would need LAG(). I finally came up with a way to use COUNT().

For testing, I created a little table T:

TEST_ID TBL N STATUS
1 A 1 Same
1 B 1 Same
2 A 1 Different
2 B 2 Different
3 A 1 Different
3 B Different
4 A 2 Different
4 B 1 Different
5 A 2 Same
5 B 2 Same
6 A 2 Different
6 B Different
7 A Different
7 B 1 Different
8 A Different
8 B 2 Different
9 A Same
9 B Same

 

If I use COUNT(DISTINCT N) I should get either 1 (same values) or 2 (different values) and I’m done: wrong! When I count N or DISTINCT N, null values don’t count. So I thought of comparing COUNT(N) and COUNT(DISTINCT N).

select t.*,
count(distinct n) over(partition by test_id) cnt_distinct,
count(n) over(partition by test_id) cnt
from t
order by 4,5,6,1,2;
TEST_ID TBL N STATUS CNT_DISTINCT CNT
3 A 1 Different 1 1
3 B Different 1 1
6 A 2 Different 1 1
6 B Different 1 1
7 A Different 1 1
7 B 1 Different 1 1
8 A Different 1 1
8 B 2 Different 1 1
2 A 1 Different 2 2
2 B 2 Different 2 2
4 A 2 Different 2 2
4 B 1 Different 2 2
9 A Same 0 0
9 B Same 0 0
1 A 1 Same 1 2
1 B 1 Same 1 2
5 A 2 Same 1 2
5 B 2 Same 1 2

 

It looks like the most concise test for a difference is COUNT(N) BETWEEN 1 and COUNT(DISTINCT N).

select t.*,
count(distinct n) over(partition by test_id) cnt_distinct,
count(n) over(partition by test_id) cnt,
case when count(n) over(partition by test_id)
  between 1 and count(distinct n) over(partition by test_id)
  then 'Different' else 'Same' end new_status
from t
order by 4,5,6,1,2;
TEST_ID TBL N STATUS CNT_DISTINCT CNT NEW_STATUS
3 A 1 Different 1 1 Different
3 B Different 1 1 Different
6 A 2 Different 1 1 Different
6 B Different 1 1 Different
7 A Different 1 1 Different
7 B 1 Different 1 1 Different
8 A Different 1 1 Different
8 B 2 Different 1 1 Different
2 A 1 Different 2 2 Different
2 B 2 Different 2 2 Different
4 A 2 Different 2 2 Different
4 B 1 Different 2 2 Different
9 A Same 0 0 Same
9 B Same 0 0 Same
1 A 1 Same 1 2 Same
1 B 1 Same 1 2 Same
5 A 2 Same 1 2 Same
5 B 2 Same 1 2 Same

 

Finally, I apply this technique to each non-PK column in my EMP comparison and I get the desired result.

select EMPNO,
  case NEW_CNT when 1 then 'EMP2' else 'EMP' end tbl,
  ROW_CNT,
  case when count(ENAME) over(partition by EMPNO)
    between 1 and count(distinct ENAME) over(partition by EMPNO)
    then ENAME end ENAME,
  case when count(JOB) over(partition by EMPNO)
    between 1 and count(distinct JOB) over(partition by EMPNO)
    then JOB end JOB,
  case when count(MGR) over(partition by EMPNO)
    between 1 and count(distinct MGR) over(partition by EMPNO)
    then MGR end MGR,
  case when count(SAL) over(partition by EMPNO)
    between 1 and count(distinct SAL) over(partition by EMPNO)
    then SAL end SAL,
  case when count(COMM) over(partition by EMPNO)
    between 1 and count(distinct COMM) over(partition by EMPNO)
    then COMM end COMM
FROM (
  select
    EMPNO, ENAME, JOB, MGR, SAL, COMM,
    sum(NEW_CNT) NEW_CNT, count(*) over(partition by EMPNO) ROW_CNT
  FROM (
    select 
    EMPNO, ENAME, JOB, MGR, SAL, COMM,
    -1 NEW_CNT
    from EMP O
    union all
    select
    EMPNO, ENAME, JOB, MGR, SAL, COMM,
    1 NEW_CNT
    from emp2 N
  )
  group by
    EMPNO, ENAME, JOB, MGR, SAL, COMM
  having sum(NEW_CNT) != 0
)
order by 1, 2, new_cnt;

Conclusion

By using two COUNT() analytic functions, I can tell whether two columns in two different rows are the same or not, considering two NULLs to be “the same”. This allows me to compare rows, then to compare columns and blank out all but the true differences.

Advertisement

7 thoughts on “Actually Seeing the Differences

  1. This is a nice technique. I wonder, wouldn’t it be easier to simply compare COUNT(DISTINCT ) to 1 and 2. If 1 then – ‘the same’ if 2 – then different.

  2. It seems that BETWEEN is not really needed in your query, simple match of COUNT to COUNT(DISTINCT …) is sufficient:


    WITH x AS (
    SELECT empno, ename, job, mgr, sal, comm, 'EMP' tbl
    FROM emp
    UNION ALL
    SELECT empno, ename, job, mgr, sal, comm, 'EMP2' tbl
    FROM emp2
    )
    SELECT empno, MAX(tbl) tbl,
    COUNT(*) OVER(PARTITION BY empno) row_cnt,
    CASE WHEN COUNT(DISTINCT ename) OVER(PARTITION BY empno)=COUNT(ename) OVER(PARTITION BY empno)
    THEN ename
    END ename,
    CASE WHEN COUNT(DISTINCT job) OVER(PARTITION BY empno)=COUNT(job) OVER(PARTITION BY empno)
    THEN job
    END job,
    CASE WHEN COUNT(DISTINCT mgr) OVER(PARTITION BY empno)=COUNT(mgr) OVER(PARTITION BY empno)
    THEN mgr
    END mgr,
    CASE WHEN COUNT(DISTINCT sal) OVER(PARTITION BY empno)=COUNT(sal) OVER(PARTITION BY empno)
    THEN sal
    END sal,
    CASE WHEN COUNT(DISTINCT comm) OVER(PARTITION BY empno)=COUNT(comm) OVER(PARTITION BY empno)
    THEN comm
    END comm
    FROM x
    GROUP BY empno, ename, job, mgr, sal, comm
    HAVING COUNT(*)=1
    ORDER BY 1, 2

    • Zahar,

      Yes, I see your point. If both values are NULL, your WHEN condition is met and mine is not, but that doesn’t matter because both values are NULL…

      Best regards,
      Stew

  3. WITH x AS (
    SELECT  empno, ename, job, mgr, sal, comm, 'EMP' tbl
    FROM emp 
    UNION ALL
    SELECT empno, ename, job, mgr, sal, comm, 'EMP2' tbl
    FROM emp2 
    )
    SELECT empno, MAX(tbl) tbl,
           COUNT(*) OVER(PARTITION BY empno) row_cnt,
           CASE WHEN COUNT(DISTINCT ename) OVER(PARTITION BY empno)=COUNT(ename) OVER(PARTITION BY empno)
                THEN ename 
           END ename,
           CASE WHEN COUNT(DISTINCT job) OVER(PARTITION BY empno)=COUNT(job) OVER(PARTITION BY empno)
                THEN job 
           END job,
           CASE WHEN COUNT(DISTINCT mgr) OVER(PARTITION BY empno)=COUNT(mgr) OVER(PARTITION BY empno)
                THEN mgr 
           END mgr,
           CASE WHEN COUNT(DISTINCT sal) OVER(PARTITION BY empno)=COUNT(sal) OVER(PARTITION BY empno)
                THEN sal 
           END sal,
           CASE WHEN COUNT(DISTINCT comm) OVER(PARTITION BY empno)=COUNT(comm) OVER(PARTITION BY empno)
                THEN comm 
           END comm
    FROM x
    GROUP BY empno, ename, job, mgr, sal, comm
    HAVING COUNT(*)=1
    ORDER BY 1, 2
    
    • Hi Tony,

      If SYS_OP_MAP_NONNULL were supported, we could just do

      COUNT(DISTINCT SYS_OP_MAP_NONNULL(n)) OVER (PARTITION BY test_id)

      which would be the most concise solution, though strangely the solution I posted is twice as fast in the few tests I did: it took less CPU.

      Best regards,
      Stew

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s