Techniques for Comparing Tables

In my “Advanced Row Pattern Matching” presentation, I demonstrate using MATCH_RECOGNIZE to compare tables. Kim Berg Hansen asked me to compare this technique with others. I did some quick tests and here are the results with some comments.

Technique Seconds
Full join 1
Group by (HASH) 1
Group by (SORT) 1.4
Analytic function 2.5
MATCH_RECOGNIZE 3.7

 

The “Full join” technique only works when we have a primary or unique key that is shared by both tables. I prefer the GROUP BY technique popularized by Tom Kyte, even though it may be a bit slower. When testing, I noticed that the HASH GROUP BY algorithm performs better than SORT GROUP BY, as others have written.

If either of the tables contains duplicate rows (which may happen if we don’t compare all of the columns, or if there is no primary key), then GROUP BY will output one row. This may be a problem if we want data (such as the ROWID)  that was not included in the comparison. In that case, we could use analytic functions or the MATCH_RECOGNIZE clause to compare and output all the rows and columns of interest. As you can see, the analytic function is more than twice as slow but it easily beats the MATCH_RECOGNIZE clause.

I use the output from table comparisons to synchronize the tables, so capturing the ROWID is important to me even when a primary or unique key is not available. For that use case, I will prefer analytic functions from now on.

Advertisements

3 thoughts on “Techniques for Comparing Tables

  1. Thanks Stew.

    How many records did you test the performance on?

    What technique (from the list in the table above) do you use in COMPARE_SYNC package? — GROUP BY (HASH) or GROUP BY (SORT)?

    I use your COMPARE_SYNC package heavily for all of my ETLs and it is a life-saver!

    Based on your findings above, will this warrant a code-change in COMPARE_SYNC package? I want to make sure that I get the latest version of your performance-optimized code.

    Thanks a million :)

    • Hi Sam,

      I tested on about 500000 rows.

      The HASH GROUP BY or SORT GROUP BY is a decision by the optimizer. If you want to check, do an EXPLAIN PLAN on the SQL you generate from my package – or better find the SQL_ID after execution and use DBMS_XPLAN.DISPLAY_CURSOR to get the execution plan that was actually used the last time.

      I will be working on a revised version of COMPARE_SYNC in the near future. I want to make the API a tiny bit simpler to use and I want to add a new feature or two. If I can make it perform better I will.

      Thanks for your kind words and best regards, Stew

      • Awesome, thanks!

        If you do get around enhancing the package code, please ensure that it is compatible with 11gR2 version as there are majority of us who are not on 12c just yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s