Compare and sync tables: GROUP BY and MERGE

[Update 2014-11-29: I simplified the code a bit and changed the C## prefix to Z##. Oracle uses the C## prefix for “common users” in CDBs, and I wanted to avoid any confusion.]

I recently wrote about Tom Kyte’s GROUP BY technique for comparing tables. Now I’m going to show how to synchronize the tables with the MERGE statement.

(Here is a list of my posts about comparing and synchronizing tables.)

Synchronizing with MERGE works when the target table has a primary or unique key; otherwise there is no way to “match” rows for updating. The source data can be a table, a view or a query, as long as it doesn’t try to put duplicate rows in the target table.

For test data, I use two tables called T_SOURCE and T_TARGET, with a primary key on KEY_NUM and KEY_DATE and two other columns called VAL_TS (timestamp) and VAL_STRING. There are 30 rows in each table and 3 differences in all: one insert, one delete, one update. The details are here > Compare and sync tables: test data.

I’ll start by enhancing the comparison to provide better input to the MERGE.

Enhancing the comparison

The GROUP BY technique returns one row for INSERT and DELETE, but two rows for UPDATE. For the MERGE to work correctly, there should be one row for the update. Here’s the original comparison code and the result:

select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
sum(OLD_CNT) OLD_CNT, sum(NEW_CNT) NEW_CNT
FROM (
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    1 OLD_CNT, 0 NEW_CNT
  from T_TARGET o
  union all
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    0 OLD_CNT, 1 NEW_CNT
  from T_SOURCE n
)
group by KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING
having sum(OLD_CNT) != sum(NEW_CNT);
KEY_NUM KEY_DATE VAL_TS VAL_STRING OLD_CNT NEW_CNT
1 04-FEB-14 04-FEB-14 11.52.13 AM INSERT – in source, not in target **************** 0 1
2 04-FEB-14 04-FEB-14 11.52.13 AM DELETE – not in source, in target **************** 1 0
3 04-FEB-14 04-FEB-14 11.52.13 AM UPDATE – update this from source ***************** 1 0
3 04-FEB-14 04-FEB-14 11.52.13 AM UPDATE – put this in target ********************** 0 1

Now come the changes. Starting with the innermost query, all the generated columns now have a “Z##” prefix to avoid collisions with the real table names. Also, I replace the OLD_CNT and NEW_CNT columns by one column: the “new” rows have 1 and the “old” rows -1.

select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
 -1 Z##CNT
from T_TARGET o
union all
select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
 1 Z##CNT
from T_SOURCE n;

When the GROUP BY is done, there are still two rows for every UPDATE. I need one row, with the data from the source. I also need to know which row requires which operation. To do this concisely, I use analytical functions on the result of the GROUP BY. This may look strange, but it is pretty powerful. (For simplicity, KEY_DATE and VAL_TS are left out of the results from now on.)

select KEY_NUM, /*KEY_DATE, VAL_TS,*/ VAL_STRING,
count(*) over(partition by KEY_NUM, KEY_DATE) show_count,
sum(Z##CNT) show_sum,
count(*) over(partition by KEY_NUM, KEY_DATE)
  - sum(Z##CNT) Z##IUD_FLAG
FROM (
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    -1 Z##CNT
  from T_TARGET o
  union all
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    1 Z##CNT
  from T_SOURCE n
)
group by KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING
having sum(Z##CNT) != 0
KEY_NUM VAL_STRING SHOW_COUNT SHOW_SUM Z##IUD_FLAG
1 INSERT – in source, not in target **************** 1 1 0
2 DELETE – not in source, in target **************** 1 -1 2
3 UPDATE – put this in target ********************** 2 1 1
3 UPDATE – update this from source ***************** 2 -1 3

The count of rows per primary key is either 1 (for insert / delete), or 2 (for update).
The “sum” of Z##CNT is either 1 (for a “new” row) or -1 (for an “old” row). By subtracting it from the count, I get a unique result for every situation:

  • 0 for INSERT
  • 1 for the “new” UPDATE row
  • 2 for DELETE
  • 3 for the “old” UPDATE row. This is the row I don’t need.

Now I just need to filter out the unneeded rows.

select * from (
  select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
  count(*) over(partition by KEY_NUM, KEY_DATE)
    - sum(Z##CNT) Z##IUD_FLAG
  FROM (
    select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
      -1 Z##CNT
    from T_TARGET o
    union all
    select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
      1 Z##CNT
    from T_SOURCE n
  )
  group by KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING
  having sum(Z##CNT) != 0
)
where Z##IUD_FLAG < 3;
KEY_NUM VAL_STRING Z##IUD_FLAG
1 INSERT – in source, not in target **************** 0
2 DELETE – not in source, in target **************** 2
3 UPDATE – put this in target ********************** 1

Now I have all the necessary input to the MERGE statement:

merge into T_TARGET o
using (
  select * from (
    select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
    count(*) over(partition by KEY_NUM, KEY_DATE)
      - sum(Z##CNT) Z##IUD_FLAG
    FROM (
      select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
        -1 Z##CNT
      from T_TARGET o
      union all
      select KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING,
        1 Z##CNT
      from T_SOURCE n
    )
    group by KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING
    having sum(Z##CNT) != 0
  )
  where Z##IUD_FLAG < 3
) n
on (o.KEY_NUM = n.KEY_NUM and o.KEY_DATE = n.KEY_DATE)
when matched then update
  set VAL_TS = n.VAL_TS, VAL_STRING = n.VAL_STRING
  delete where n.Z##IUD_FLAG = 2
when not matched then insert
  (KEY_NUM, KEY_DATE, VAL_TS, VAL_STRING)
  values(n.KEY_NUM, n.KEY_DATE, n.VAL_TS, n.VAL_STRING);

3 rows merged.
If I run the same statement again, I get 0 rows merged. because the tables are now synchronized :)

Advertisements

2 thoughts on “Compare and sync tables: GROUP BY and MERGE

    • Hi Manjula,

      As for handling LOBs, the only thing I have done is to create a view on the source table that contains the first 4000 bytes of the LOB as a varchar2. The target only contains the varchar2, not the LOB. If you want more than that, it would be best to say in detail what you want to do, how you mean to compare and whether your database character set is single byte or multibyte (UTF8). Also, are you comparing tables in the same database or not? Pulling LOBs over a dblink to compare is going to take a long time!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s