My final followup of Jonathan Lewis’ Differences post looks at the
DBMS_COMPARISON package. I have my doubts about this package, which is why I added the question mark in the title.
For a pure SQL solution that I have no doubts about, see the list of my posts about comparing and synchronizing tables.
DBMS_COMPARISON was introduced in Oracle 11.1. There appears to be some confusion about whether it requires an extra licence. According to the Oracle team that sells to my company, only the database license is required. What costs extra is the Database Lifecycle Management Pack of Enterprise Manager, which has an interface to compare and converge both table data and table structures.
The package is well documented here > PL/SQL Packages and Types Reference: DBMS_COMPARISON
and here > Streams Replication Administrator’s Guide: Comparing and Converging Data. It is designed to “compare and converge” tables that are shared across databases using Advanced Replication or Streams-based replication. (The documentation uses the word “converge” when I use “sync”, but the intent is the same.)
Why not just use SQL?
GROUP BY method works great for comparing tables, but it does take lots of resources, such as memory and
TEMP space, if the tables have many rows or columns. What’s more, if one of the tables is remote then its data has to be transferred entirely to the local database for the comparison.
DBMS_COMPARISON uses less memory and reduces network traffic with two techniques:
- It uses a hash of the data instead of the data itself.
- It breaks the comparison down into smaller units of work, each of which uses little memory.
ORA_HASH: a problematic solution
Hashes have long been used to verify the integrity of files after transfer. The hash is calculated at the source, then recalculated at the target; if the result is the same, then we conclude that the file has not been corrupted during the transfer. If the result is different, then we know that the file has changed.
DBMS_COMPARISON uses this same technique to compare rows (or ranges of rows) and identify differences.
Despite the advantages of the
DBMS_COMPARISON architecture, the solution has a weak link in my opinion: the
ORA_HASH function. The rest of my post will concentrate on this point.
ORA_HASH function returns a “bucket value” between 0 and 4,294,967,295, or power(2,32)-1. In other words, it is a 32-bit hashing algorithm. Since the world’s data has more than 4,294,967,295 different values, sooner or later
ORA_HASH is going to return the same value for two different inputs! This is called a “hash collision”. If you use a hash to compare data and you get a hash collision, you will think the data are the same when they are in fact different!
What are the chances of a hash collision?
The chances of a hash collision depend on the number of bits in the hash (the more the better) and on how well the hash function spreads the results out across all the possible buckets. There are hash functions that produce 160, 256 or even 512 bits.
ORA_HASH is limited to 32 bits, so it should produce lots of collisions.
Here’s a little test to see how quickly we can produce a collision:
with data as ( select level n, ora_hash(level) ohash from dual connect by level <= 200000 ) select count(distinct ohash) number_of_collisions from data where ohash in ( select ohash from data group by ohash having count(*) > 1 );
We get a collision about every 25,000 rows.
This first test is unfair of course. Actually, the
DBMS_COMPARISON.COMPARE procedure compares not only hashes but primary keys. The chances of a hash collision with the same primary key are much less. To show this, I’ll borrow a bit of code from a SQL trace of the
with a as ( select level n from dual connect by level <= 2056 ) , b as ( select level n from dual connect by level <= 1005 ) , hash_data as ( select a.n an, b.n bn, ora_hash( nvl(to_char(a.n), 'ORA$STREAMS$NV'), 4294967295, ora_hash(nvl(to_char(b.n), 'ORA$STREAMS$NV'), 4294967295, 0) ) s_hash from a, b ) select * from hash_data where (an, s_hash) in ( select an, s_hash from hash_data group by an, s_hash having count(*) > 1 );
Here I got my first collision at row 2,066,280.
To double-check my findings, I’ll create tables with these “colliding” values and compare them using the package:
create table a (an primary key, bn) as select 2056, 848 from dual; create table b (an primary key, bn) as select 2056, 1005 from dual; BEGIN DBMS_COMPARISON.PURGE_COMPARISON( comparison_name => 'a_b' ); DBMS_COMPARISON.DROP_COMPARISON( comparison_name => 'a_b' ); END; / BEGIN DBMS_COMPARISON.CREATE_COMPARISON( comparison_name => 'a_b', schema_name => user, object_name => 'A', remote_object_name => 'B', dblink_name => null); END; / SET SERVEROUTPUT ON DECLARE consistent BOOLEAN; scan_info DBMS_COMPARISON.COMPARISON_TYPE; BEGIN consistent := DBMS_COMPARISON.COMPARE( comparison_name => 'a_b', scan_info => scan_info, perform_row_dif => TRUE ); DBMS_OUTPUT.PUT_LINE('Scan ID: '||scan_info.scan_id); IF consistent=TRUE THEN DBMS_OUTPUT.PUT_LINE('No differences were found.'); ELSE DBMS_OUTPUT.PUT_LINE('Differences were found.'); END IF; END; / anonymous block completed Scan ID: 4 No differences were found.
Conclusion: still a question mark
It may be OK to use
DBMS_COMPARISON as a low-cost “health check” for replicated tables, but if you want to fully synchronize two tables then I still recommend the
GROUP BY and
MERGE solution I presented in earlier posts.
DBMS_COMPARISON would be much improved if it used a hash function with more bits, but every hash function is subject to collisions. The ideal would be to use compressed data instead of a hash, since there would be no collisions – but that would be a completely different solution.