When comparing data between two tables, it’s one thing to query the differences and another thing to actually see them.
I have two tables EMP and EMP2. Here are the rows that are different. Quick, what columns have changed?
EMPNO | TBL | ROW_CNT | ENAME | JOB | MGR | SAL | COMM |
---|---|---|---|---|---|---|---|
7369 | EMP | 2 | SMITH | CLERK | 7902 | 800 | |
7369 | EMP2 | 2 | SMITE | CLERK | 7902 | 800 | |
7499 | EMP | 2 | ALLEN | SALESMAN | 7698 | 1600 | 300 |
7499 | EMP2 | 2 | ALLEN | SALESGUY | 7698 | 1600 | 300 |
7521 | EMP | 2 | WARD | SALESMAN | 7698 | 1250 | 500 |
7521 | EMP2 | 2 | WARD | SALESMAN | 7788 | 1250 | 500 |
7654 | EMP | 2 | MARTIN | SALESMAN | 7698 | 1250 | 1400 |
7654 | EMP2 | 2 | MARTIN | SALESMAN | 7698 | 1750 | 1400 |
7698 | EMP | 2 | BLAKE | MANAGER | 7839 | 2850 | |
7698 | EMP2 | 2 | BLAKE | MANAGER | 7839 | 2850 | 1000 |
7788 | EMP2 | 1 | SCOTT | ANALYST | 7566 | 3000 | |
7902 | EMP | 1 | FORD | ANALYST | 7566 | 3000 |
I thought so. Now suppose I blank out the columns that are the same in both tables?
EMPNO | TBL | ROW_CNT | ENAME | JOB | MGR | SAL | COMM |
---|---|---|---|---|---|---|---|
7369 | EMP | 2 | SMITH | ||||
7369 | EMP2 | 2 | SMITE | ||||
7499 | EMP | 2 | SALESMAN | ||||
7499 | EMP2 | 2 | SALESGUY | ||||
7521 | EMP | 2 | 7698 | ||||
7521 | EMP2 | 2 | 7788 | ||||
7654 | EMP | 2 | 1250 | ||||
7654 | EMP2 | 2 | 1750 | ||||
7698 | EMP | 2 | |||||
7698 | EMP2 | 2 | 1000 | ||||
7788 | EMP2 | 1 | SCOTT | ANALYST | 7566 | 3000 | |
7902 | EMP | 1 | FORD | ANALYST | 7566 | 3000 |
That’s better. Now, how can I do that?
Comparing columns across rows
I need to compare two columns in two different rows, and I need the result of the comparison in each row. That sounds like a job for analytic functions. It would be hard to use LAG() or LEAD() because one row would need LEAD() and the other would need LAG(). I finally came up with a way to use COUNT().
For testing, I created a little table T:
TEST_ID | TBL | N | STATUS |
---|---|---|---|
1 | A | 1 | Same |
1 | B | 1 | Same |
2 | A | 1 | Different |
2 | B | 2 | Different |
3 | A | 1 | Different |
3 | B | Different | |
4 | A | 2 | Different |
4 | B | 1 | Different |
5 | A | 2 | Same |
5 | B | 2 | Same |
6 | A | 2 | Different |
6 | B | Different | |
7 | A | Different | |
7 | B | 1 | Different |
8 | A | Different | |
8 | B | 2 | Different |
9 | A | Same | |
9 | B | Same |
If I use COUNT(DISTINCT N) I should get either 1 (same values) or 2 (different values) and I’m done: wrong! When I count N or DISTINCT N, null values don’t count. So I thought of comparing COUNT(N) and COUNT(DISTINCT N).
select t.*, count(distinct n) over(partition by test_id) cnt_distinct, count(n) over(partition by test_id) cnt from t order by 4,5,6,1,2;
TEST_ID | TBL | N | STATUS | CNT_DISTINCT | CNT |
---|---|---|---|---|---|
3 | A | 1 | Different | 1 | 1 |
3 | B | Different | 1 | 1 | |
6 | A | 2 | Different | 1 | 1 |
6 | B | Different | 1 | 1 | |
7 | A | Different | 1 | 1 | |
7 | B | 1 | Different | 1 | 1 |
8 | A | Different | 1 | 1 | |
8 | B | 2 | Different | 1 | 1 |
2 | A | 1 | Different | 2 | 2 |
2 | B | 2 | Different | 2 | 2 |
4 | A | 2 | Different | 2 | 2 |
4 | B | 1 | Different | 2 | 2 |
9 | A | Same | 0 | 0 | |
9 | B | Same | 0 | 0 | |
1 | A | 1 | Same | 1 | 2 |
1 | B | 1 | Same | 1 | 2 |
5 | A | 2 | Same | 1 | 2 |
5 | B | 2 | Same | 1 | 2 |
It looks like the most concise test for a difference is COUNT(N) BETWEEN 1 and COUNT(DISTINCT N).
select t.*, count(distinct n) over(partition by test_id) cnt_distinct, count(n) over(partition by test_id) cnt, case when count(n) over(partition by test_id) between 1 and count(distinct n) over(partition by test_id) then 'Different' else 'Same' end new_status from t order by 4,5,6,1,2;
TEST_ID | TBL | N | STATUS | CNT_DISTINCT | CNT | NEW_STATUS |
---|---|---|---|---|---|---|
3 | A | 1 | Different | 1 | 1 | Different |
3 | B | Different | 1 | 1 | Different | |
6 | A | 2 | Different | 1 | 1 | Different |
6 | B | Different | 1 | 1 | Different | |
7 | A | Different | 1 | 1 | Different | |
7 | B | 1 | Different | 1 | 1 | Different |
8 | A | Different | 1 | 1 | Different | |
8 | B | 2 | Different | 1 | 1 | Different |
2 | A | 1 | Different | 2 | 2 | Different |
2 | B | 2 | Different | 2 | 2 | Different |
4 | A | 2 | Different | 2 | 2 | Different |
4 | B | 1 | Different | 2 | 2 | Different |
9 | A | Same | 0 | 0 | Same | |
9 | B | Same | 0 | 0 | Same | |
1 | A | 1 | Same | 1 | 2 | Same |
1 | B | 1 | Same | 1 | 2 | Same |
5 | A | 2 | Same | 1 | 2 | Same |
5 | B | 2 | Same | 1 | 2 | Same |
Finally, I apply this technique to each non-PK column in my EMP comparison and I get the desired result.
select EMPNO, case NEW_CNT when 1 then 'EMP2' else 'EMP' end tbl, ROW_CNT, case when count(ENAME) over(partition by EMPNO) between 1 and count(distinct ENAME) over(partition by EMPNO) then ENAME end ENAME, case when count(JOB) over(partition by EMPNO) between 1 and count(distinct JOB) over(partition by EMPNO) then JOB end JOB, case when count(MGR) over(partition by EMPNO) between 1 and count(distinct MGR) over(partition by EMPNO) then MGR end MGR, case when count(SAL) over(partition by EMPNO) between 1 and count(distinct SAL) over(partition by EMPNO) then SAL end SAL, case when count(COMM) over(partition by EMPNO) between 1 and count(distinct COMM) over(partition by EMPNO) then COMM end COMM FROM ( select EMPNO, ENAME, JOB, MGR, SAL, COMM, sum(NEW_CNT) NEW_CNT, count(*) over(partition by EMPNO) ROW_CNT FROM ( select EMPNO, ENAME, JOB, MGR, SAL, COMM, -1 NEW_CNT from EMP O union all select EMPNO, ENAME, JOB, MGR, SAL, COMM, 1 NEW_CNT from emp2 N ) group by EMPNO, ENAME, JOB, MGR, SAL, COMM having sum(NEW_CNT) != 0 ) order by 1, 2, new_cnt;
Conclusion
By using two COUNT() analytic functions, I can tell whether two columns in two different rows are the same or not, considering two NULLs to be “the same”. This allows me to compare rows, then to compare columns and blank out all but the true differences.
This is a nice technique. I wonder, wouldn’t it be easier to simply compare COUNT(DISTINCT ) to 1 and 2. If 1 then – ‘the same’ if 2 – then different.
Hi Zahar,
That doesn’t work when one or both of the values is NULL. I did explain that in my post…
Best regards,
Stew
It seems that BETWEEN is not really needed in your query, simple match of COUNT to COUNT(DISTINCT …) is sufficient:
WITH x AS (
SELECT empno, ename, job, mgr, sal, comm, 'EMP' tbl
FROM emp
UNION ALL
SELECT empno, ename, job, mgr, sal, comm, 'EMP2' tbl
FROM emp2
)
SELECT empno, MAX(tbl) tbl,
COUNT(*) OVER(PARTITION BY empno) row_cnt,
CASE WHEN COUNT(DISTINCT ename) OVER(PARTITION BY empno)=COUNT(ename) OVER(PARTITION BY empno)
THEN ename
END ename,
CASE WHEN COUNT(DISTINCT job) OVER(PARTITION BY empno)=COUNT(job) OVER(PARTITION BY empno)
THEN job
END job,
CASE WHEN COUNT(DISTINCT mgr) OVER(PARTITION BY empno)=COUNT(mgr) OVER(PARTITION BY empno)
THEN mgr
END mgr,
CASE WHEN COUNT(DISTINCT sal) OVER(PARTITION BY empno)=COUNT(sal) OVER(PARTITION BY empno)
THEN sal
END sal,
CASE WHEN COUNT(DISTINCT comm) OVER(PARTITION BY empno)=COUNT(comm) OVER(PARTITION BY empno)
THEN comm
END comm
FROM x
GROUP BY empno, ename, job, mgr, sal, comm
HAVING COUNT(*)=1
ORDER BY 1, 2
Zahar,
Yes, I see your point. If both values are NULL, your WHEN condition is met and mine is not, but that doesn’t matter because both values are NULL…
Best regards,
Stew
Hi Stew,
I don’t know if this entry (https://docs.oracle.com/database/121/DWHSG/basicmv.htm#DWHSG8217) counts as official documentation for SYS_OP_MAP_NONNULL but if it does an easier to read variant becomes “supported”;
DECODE (
MAX (SYS_OP_MAP_NONNULL (n)) OVER (PARTITION BY test_id),
MIN (SYS_OP_MAP_NONNULL (n)) OVER (PARTITION BY test_id),
‘Same’,
‘Different’)
Hi Tony,
If SYS_OP_MAP_NONNULL were supported, we could just do
COUNT(DISTINCT SYS_OP_MAP_NONNULL(n)) OVER (PARTITION BY test_id)
which would be the most concise solution, though strangely the solution I posted is twice as fast in the few tests I did: it took less CPU.
Best regards,
Stew