Compare and sync tables: DBMS_COMPARISON?

My final followup of Jonathan Lewis’ Differences post looks at the DBMS_COMPARISON package. I have my doubts about this package, which is why I added the question mark in the title.

For a pure SQL solution that I have no doubts about, see the list of my posts about comparing and synchronizing tables.

DBMS_COMPARISON was introduced in Oracle 11.1. There appears to be some confusion about whether it requires an extra licence. According to the Oracle team that sells to my company, only the database license is required. What costs extra is the Database Lifecycle Management Pack of Enterprise Manager, which has an interface to compare and converge both table data and table structures.

The package is well documented here > PL/SQL Packages and Types Reference: DBMS_COMPARISON
and here > Streams Replication Administrator’s Guide: Comparing and Converging Data. It is designed to “compare and converge” tables that are shared across databases using Advanced Replication or Streams-based replication. (The documentation uses the word “converge” when I use “sync”, but the intent is the same.)

Why not just use SQL?

Tom Kytes’s GROUP BY method works great for comparing tables, but it does take lots of resources, such as memory and TEMP space, if the tables have many rows or columns. What’s more, if one of the tables is remote then its data has to be transferred entirely to the local database for the comparison.

DBMS_COMPARISON uses less memory and reduces network traffic with two techniques:

  1. It uses a hash of the data instead of the data itself.
  2. It breaks the comparison down into smaller units of work, each of which uses little memory.

ORA_HASH: a problematic solution

Hashes have long been used to verify the integrity of files after transfer. The hash is calculated at the source, then recalculated at the target; if the result is the same, then we conclude that the file has not been corrupted during the transfer. If the result is different, then we know that the file has changed. DBMS_COMPARISON uses this same technique to compare rows (or ranges of rows) and identify differences.

Despite the advantages of the DBMS_COMPARISON architecture, the solution has a weak link in my opinion: the ORA_HASH function. The rest of my post will concentrate on this point.

The ORA_HASH function returns a “bucket value” between 0 and 4,294,967,295, or power(2,32)-1. In other words, it is a 32-bit hashing algorithm. Since the world’s data has more than 4,294,967,295 different values, sooner or later ORA_HASH is going to return the same value for two different inputs! This is called a “hash collision”. If you use a hash to compare data and you get a hash collision, you will think the data are the same when they are in fact different!

What are the chances of a hash collision?

The chances of a hash collision depend on the number of bits in the hash (the more the better) and on how well the hash function spreads the results out across all the possible buckets. There are hash functions that produce 160, 256 or even 512 bits. ORA_HASH is limited to 32 bits, so it should produce lots of collisions.

Here’s a little test to see how quickly we can produce a collision:

with data as (
  select level n, ora_hash(level) ohash
  from dual
  connect by level <= 200000
select count(distinct ohash) number_of_collisions
from data
where ohash in (
   select ohash from data
   group by ohash
   having count(*) > 1

We get a collision about every 25,000 rows.

This first test is unfair of course. Actually, the DBMS_COMPARISON.COMPARE procedure compares not only hashes but primary keys. The chances of a hash collision with the same primary key are much less. To show this, I’ll borrow a bit of code from a SQL trace of the COMPARE procedure:

with a as (
  select level n from dual
  connect by level <= 2056
, b as (
  select level n from dual
  connect by level <= 1005 
) , hash_data as (
   select a.n an, b.n bn,
    nvl(to_char(a.n), 'ORA$STREAMS$NV'),
    ora_hash(nvl(to_char(b.n), 'ORA$STREAMS$NV'), 4294967295, 0)
  ) s_hash
  from a, b 
select * from hash_data
where (an, s_hash) in (
  select an, s_hash from hash_data
  group by an, s_hash
  having count(*) > 1
2056 848 2074509648
2056 1005 2074509648

Here I got my first collision at row 2,066,280.

To double-check my findings, I’ll create tables with these “colliding” values and compare them using the package:

create table a (an primary key, bn)
as select 2056, 848 from dual;

create table b (an primary key, bn)
as select 2056, 1005 from dual;

    comparison_name => 'a_b'
    comparison_name => 'a_b'
    comparison_name    => 'a_b',
    schema_name        => user,
    object_name        => 'A',
    remote_object_name => 'B',
    dblink_name        => null);

  consistent   BOOLEAN;
    comparison_name => 'a_b',
    scan_info       => scan_info,
    perform_row_dif => TRUE
  DBMS_OUTPUT.PUT_LINE('Scan ID: '||scan_info.scan_id);
  IF consistent=TRUE THEN
    DBMS_OUTPUT.PUT_LINE('No differences were found.');
    DBMS_OUTPUT.PUT_LINE('Differences were found.');

anonymous block completed
Scan ID: 4
No differences were found.

Conclusion: still a question mark

It may be OK to use DBMS_COMPARISON as a low-cost “health check” for replicated tables, but if you want to fully synchronize two tables then I still recommend the GROUP BY and MERGE solution I presented in earlier posts. DBMS_COMPARISON would be much improved if it used a hash function with more bits, but every hash function is subject to collisions. The ideal would be to use compressed data instead of a hash, since there would be no collisions – but that would be a completely different solution.

7 thoughts on “Compare and sync tables: DBMS_COMPARISON?

  1. Hi Steve, When you run COMPARE_SYNC.COMPARE_SQL, how do we know that differences that exist is due to inflight transaction or theyare true differences?

    • Hello Vishaldesai,

      When you run one of the COMPARE_SYNC functions, it returns a SQL statement that you then execute.

      Whenever you run a SQL statement in Oracle, the database never shows you any changes from inflight transactions, except the one you are currently in. In other words, the only uncommitted changes you can see are the ones you have made yourself.

      This is true of any SQL statement, not just the ones my package generates.

  2. Hi Steve,

    I’m trying to use this package that will generate SQL statements for all tables that we are replicating using golden gate from source to target and run SQL statements and show difference between source and target on regular basis.

    Following SQL will list of primary keys that should be inserted, removed or updated on target side to match source but differences could be due to records that are in flight so in first check insert table_name,primary key and flag into some log table and then after sometime run recheck that will check source and target only for those primary key values that reported differences (something similar to dbms_comparision recheck).

    I=Insert, M=match, R=remove and D=Different
    SELECT table_name,table_pk,
    WHEN s_old = 0 AND s_new = 1 AND cnt = 1 THEN ‘I’
    WHEN s_old = 1 AND s_new = 1 AND cnt = 1 THEN ‘M’
    WHEN s_old = 1 AND s_new = 0 AND cnt = 1 THEN ‘R’
    WHEN s_old = 1 AND s_new = 1 AND cnt = 2 THEN ‘D’
    AS FLAG,
    COUNT (1) AS cont
    FROM ( SELECT table_pk,
    SUM (old_cnt) s_old,
    SUM (new_cnt) s_new,
    COUNT (1) cnt
    FROM (***original query returned by package)
    GROUP BY table_pk having not (sum(old_cnt)=1 and sum(old_cnt)=1 and count(1)=1) )
    WHEN s_old = 0 AND s_new = 1 AND cnt = 1 THEN ‘I’
    WHEN s_old = 1 AND s_new = 1 AND cnt = 1 THEN ‘M’
    WHEN s_old = 1 AND s_new = 0 AND cnt = 1 THEN ‘R’
    WHEN s_old = 1 AND s_new = 1 AND cnt = 2 THEN ‘D’
    END, table_pk;

  3. Hi Steve, I am trying to use this package to get the difference in data between the SAME table which is present in two different databases in oracle. I have couple of queries. 1. From where can we get the actual column which has the difference? I couldn’t place this is any of the resultant views.

  4. Don’t know whether you still follow this but I have a couple of questions if you have thoughts:
    – If I want to identify schema changes within a database since a given point in time, rather than compare two existing databases, how could I determine that? i.e. new/dropped tables, columns, etc.?
    – if I wanted to identify any changes in privileges since a given point in time, how could I approach that?

    any pointers appreciated!

    • Hi Antonio,

      I have never attacked that problem in my work, so I may have thoughts but not experience. I think that observing historical changes requires storing those changes, either through source control outside the database or maybe using AUDIT. If you have not stored the changes as they happened, I suspect it is too late.

      Best regards, Stew

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s