Compare and sync tables: DBMS_COMPARISON?

My final followup of Jonathan Lewis’ Differences post looks at the DBMS_COMPARISON package. I have my doubts about this package, which is why I added the question mark in the title.

For a pure SQL solution that I have no doubts about, see this series:

DBMS_COMPARISON was introduced in Oracle 11.1. There appears to be some confusion about whether it requires an extra licence. According to the Oracle team that sells to my company, only the database license is required. What costs extra is the Database Lifecycle Management Pack of Enterprise Manager, which has an interface to compare and converge both table data and table structures.

The package is well documented here > PL/SQL Packages and Types Reference: DBMS_COMPARISON
and here > Streams Replication Administrator’s Guide: Comparing and Converging Data. It is designed to “compare and converge” tables that are shared across databases using Advanced Replication or Streams-based replication. (The documentation uses the word “converge” when I use “sync”, but the intent is the same.)

Why not just use SQL?

Tom Kytes’s GROUP BY method works great for comparing tables, but it does take lots of resources, such as memory and TEMP space, if the tables have many rows or columns. What’s more, if one of the tables is remote then its data has to be transferred entirely to the local database for the comparison.

DBMS_COMPARISON uses less memory and reduces network traffic with two techniques:

  1. It uses a hash of the data instead of the data itself.
  2. It breaks the comparison down into smaller units of work, each of which uses little memory.

ORA_HASH: a problematic solution

Hashes have long been used to verify the integrity of files after transfer. The hash is calculated at the source, then recalculated at the target; if the result is the same, then we conclude that the file has not been corrupted during the transfer. If the result is different, then we know that the file has changed. DBMS_COMPARISON uses this same technique to compare rows (or ranges of rows) and identify differences.

Despite the advantages of the DBMS_COMPARISON architecture, the solution has a weak link in my opinion: the ORA_HASH function. The rest of my post will concentrate on this point.

The ORA_HASH function returns a “bucket value” between 0 and 4,294,967,295, or power(2,32)-1. In other words, it is a 32-bit hashing algorithm. Since the world’s data has more than 4,294,967,295 different values, sooner or later ORA_HASH is going to return the same value for two different inputs! This is called a “hash collision”. If you use a hash to compare data and you get a hash collision, you will think the data are the same when they are in fact different!

What are the chances of a hash collision?

The chances of a hash collision depend on the number of bits in the hash (the more the better) and on how well the hash function spreads the results out across all the possible buckets. There are hash functions that produce 160, 256 or even 512 bits. ORA_HASH is limited to 32 bits, so it should produce lots of collisions.

Here’s a little test to see how quickly we can produce a collision:

with data as (
  select level n, ora_hash(level) ohash
  from dual
  connect by level <= 200000
)
select count(distinct ohash) number_of_collisions
from data
where ohash in (
  select ohash from data
  group by ohash
  having count(*) > 1
);
NUMBER_OF_COLLISIONS
8

We get a collision about every 25,000 rows.

This first test is unfair of course. Actually, the DBMS_COMPARISON.COMPARE procedure compares not only hashes but primary keys. The chances of a hash collision with the same primary key are much less. To show this, I’ll borrow a bit of code from a SQL trace of the COMPARE procedure:

with a as (
  select level n from dual
  connect by level <= 2056
)
, b as (
  select level n from dual
  connect by level <= 1005
)
, hash_data as (
  select a.n an, b.n bn,
  ora_hash(nvl(to_char(a.n), 'ORA$STREAMS$NV'), 4294967295, ora_hash(nvl(
  to_char(b.n), 'ORA$STREAMS$NV'), 4294967295, 0)) s_hash
  from a, b
)
select * from hash_data
where (an, s_hash) in (
  select an, s_hash from hash_data
  group by an, s_hash
  having count(*) > 1
);
AN BN S_HASH
2056 848 2074509648
2056 1005 2074509648

Here I got my first collision at row 2,066,280.

To double-check my findings, I’ll create tables with these “colliding” values and compare them using the package:

create table a (an primary key, bn)
as select 2056, 848 from dual;

create table b (an primary key, bn)
as select 2056, 1005 from dual;

BEGIN
  DBMS_COMPARISON.PURGE_COMPARISON(
    comparison_name => 'a_b'
  );
  DBMS_COMPARISON.DROP_COMPARISON(
    comparison_name => 'a_b'
  );
END;
/
BEGIN
  DBMS_COMPARISON.CREATE_COMPARISON(
    comparison_name    => 'a_b',
    schema_name        => user,
    object_name        => 'A',
    remote_object_name => 'B',
    dblink_name        => null);
END;
/

SET SERVEROUTPUT ON
DECLARE
  consistent   BOOLEAN;
  scan_info    DBMS_COMPARISON.COMPARISON_TYPE;
BEGIN
  consistent := DBMS_COMPARISON.COMPARE(
    comparison_name => 'a_b',
    scan_info       => scan_info,
    perform_row_dif => TRUE
  );
  DBMS_OUTPUT.PUT_LINE('Scan ID: '||scan_info.scan_id);
  IF consistent=TRUE THEN
    DBMS_OUTPUT.PUT_LINE('No differences were found.');
  ELSE
    DBMS_OUTPUT.PUT_LINE('Differences were found.');
  END IF;
END;
/

anonymous block completed
Scan ID: 4
No differences were found.

Conclusion: still a question mark

It may be OK to use DBMS_COMPARISON as a low-cost “health check” for replicated tables, but if you want to fully synchronize two tables then I still recommend the GROUP BY and MERGE solution I presented in earlier posts. DBMS_COMPARISON would be much improved if it used a hash function with more bits, but every hash function is subject to collisions. The ideal would be to use compressed data instead of a hash, since there would be no collisions – but that would be a completely different solution.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s