Compare tables: Tom Kyte and GROUP BY

This is the second in a series of posts on comparing data and synchronizing tables. I announced that I would tackle dbms_comparison next, but I decided to write about a pure SQL solution first. It will be easier to talk about dbms_comparison if I have something to compare it with. Besides, I wanted to name-drop one more time ;)

(Here is a list of my posts about comparing and synchronizing tables.)

Tom Kyte’s solution for comparing tables

Tom explained this method here > Oracle Magazine: On Injecting and Comparing. Scroll down to the title “Comparing the Contents of Two Tables”. The link to the original AskTom discussion no longer works, but this one does > Marco and Tom’s discussion.

I’ll demonstrate the method using the same test data as my previous post, but with fewer rows. To repeat: I use two tables called T_SOURCE and T_TARGET, with a primary key on KEY_NUM and KEY_DATE and two other columns called VAL_TS (timestamp) and VAL_STRING. This time there are 30 rows in each table and 3 differences in all. The details are here > Compare and sync tables: test data

Here is a summary of the data from both tables:

TABLE_NAME CNT VAL_STRING
t_target 1 DELETE – not in source, in target ****************
t_source 1 INSERT – in source, not in target ****************
t_source 1 UPDATE – put this in target **********************
t_target 1 UPDATE – update this from source *****************
t_source 28 ZERO – same in source and target *****************
t_target 28 ZERO – same in source and target *****************

Each table has one row that is not in the other, and they both have one row with the same KEY_NUM but different VAL_STRINGs.

The GROUP BY solution, step by step

The ultimate purpose here is to make T_TARGET identical to T_SOURCE, so I’m going to call T_TARGET the “old” table and T_SOURCE the “new” one. I’ll use the alias “o” for old and “n” for new.
First I do a full scan of each table, selecting all the non-virtual columns. I add two columns: OLD_CNT is 1 when the row belongs to T_TARGET, and NEW_CNT is 1 when the row comes from T_SOURCE.

select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
  1 OLD_CNT, 0 NEW_CNT
from T_TARGET o
union all
select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
  0 OLD_CNT, 1 NEW_CNT
from T_SOURCE n;

Now comes the comparison magic. Comparing values in Oracle is not simple if one or both can be NULL, since a NULL value is neither equal to nor different from anything else. Fortunately, GROUP BY considers NULL values to be equivalent, so I just GROUP BY all the column values and SUM the OLD_CNT and NEW_CNT numbers.
Afterwards, if OLD_CNT = NEW_CNT then the rows are identical.
Another neat thing about GROUP BY is that the HAVING clause filters the result immediately, so I can throw out the identical rows as I go.

select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
sum(OLD_CNT) OLD_CNT, sum(NEW_CNT) NEW_CNT
FROM (
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    1 OLD_CNT, 0 NEW_CNT
  from T_TARGET o
  union all
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    0 OLD_CNT, 1 NEW_CNT
  from T_SOURCE n
)
group by KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING
having sum(OLD_CNT) != sum(NEW_CNT);
KEY_NUM KEY_DATE VAL_TS VAL_STRING OLD_CNT NEW_CNT
1 04-FEB-14 04-FEB-14 11.52.13 AM INSERT – in source, not in target **************** 0 1
2 04-FEB-14 04-FEB-14 11.52.13 AM DELETE – not in source, in target **************** 1 0
3 04-FEB-14 04-FEB-14 11.52.13 AM UPDATE – update this from source ***************** 1 0
3 04-FEB-14 04-FEB-14 11.52.13 AM UPDATE – put this in target ********************** 0 1

Those are all the steps: we’re done already!

  • The first row is in the “new” table but not in the old.
  • The second row is in the “old” table but not in the new.
  • The row where KEY_NUM = 3 is in both tables, but with different values.

Remember I said these tables have primary keys? What happens when you compare tables without primary keys? In that case there may but duplicates, more than one row with all the same values. This is not a problem: if there are duplicates, OLD_CNT and / or NEW_CNT will be greater than 1. If the counts are equal, all is well; if they are not, the result will show how many of each row exist in each table.

GROUP BY advantages

To my mind, this solution is efficient because it scans each table only once. It is elegant because GROUP BY handles NULL and NOT NULL values implicitly, and because the HAVING clause filters immediately. You can see the differences easily because the compared values are right under each other in the same column. Finally, if there are duplicate rows you can see how many of each row were found in each table.

Update: another advantage of this technique is that you can compare query results, not just tables. All you have to do is replace one or both tables by “inline views”, that is, SELECT statements within parentheses. I do this when rewriting queries for performance, as a quick check that the results are the same.

The only drawback is that you have to write the SQL statement to do the comparison. In my next post, I’ll provide a bit of PL/SQL that will generate the SQL for you.

Advertisements

3 thoughts on “Compare tables: Tom Kyte and GROUP BY

  1. Thank so much for writing this out so clearly. That group by query is the CDC capture magic I’ve been searching for…. where have you been all my life?! :)

  2. Pingback: Real Time Materialized Views (Part 2) | CTAndrewSayer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s