Chunking tables 2: Requirement

As I mentioned in my previous post, the basic idea is to divide a table into 12 “equal chunks”, each “chunk” being a rowid range that covers the same number of blocks (give or take 1). What does this mean? Why do this? Why divide by blocks and not rows? Here are my thoughts.

Current list of posts: 

Good old batch processing

Sometimes we have to do DML on many rows in a table, perhaps massive deletes or even more massive updates. Back in the old days these operations were called “batch jobs” and we spent half our time writing them. Guess what? Today in many shops we still spend half our time writing them! These processes can cause us problems such as:

  • Taking too long to finish;
  • Using lots of resources such as temp space, undo or redo;
  • Blocking concurrent online transactions.

This last problem is more common than ever, now that many OLTP systems need to be constantly available.

Parallelism

If our SQL is “taking too long” and we have unused resources, then we can take advantage of Oracle’s built-in parallelism. This is a terrific feature that works for both SELECTs and DML, that can be configured without changing any SQL, and that does all the parallel processing within one transaction.

Unfortunately built-in parallelism tends to increase resource consumption, not decrease it, and it interferes just as much with concurrent OLTP processing. Not only that, but it stinks when accessing remote databases because it uses only one DBLINK connection.

Do-it-yourself parallelism and intermediate commits

When built-in parallelism doesn’t help, we can help ourselves by splitting the processing up into chunks. Each chunk is processed separately as a separate transaction. We can process the chunks in parallel jobs or simply do one chunk at a time.

How can this chunk-based processing help us with our problems?

  • If we process chunks in parallel,
    • everything should finish faster
    • and when accessing a remote database, we can use more than one DBLINK connection at the same time.
  • If we process the chunks serially,
    • we use less temp and undo and free up resources when we commit at the end of each chunk
    • and by committing frequently we avoid blocking online transactions for long periods.

The major drawback of chunk-based processing comes from the intermediate commits. What happens if some exception occurs and we have only processed half the chunks? Changes to those chunks have been committed, so we mustn’t process them again. We need to be able to restart the job while processing only the remaining chunks.

Since version 11.2, the Oracle database provides the DBMS_PARALLEL_EXECUTE package to enable incremental DML (in parallel or not). This package keeps track of which chunks have committed and which have not, so that upon restart the committed chunks are not done again.

To Split into “equal chunks”: use rows or blocks?

Why are we splitting our table into chunks again? In order to process separate transactions that each take less time and use fewer resources.

What’s the use of that if one of the chunks contains almost the whole table?

We need to ensure that the chunks are equal in size so that each transaction takes about the same time and uses the same amount of resources.

So how do we define “equal”? At first glance, for exact equality each chunk should have the same number of rows. For our purposes, this is not always true. Oracle does some work at the row level and some work at the block level – including all access to disk and to the buffer cache. When it comes to time spent and resources used, blocks may be a better indicator than rows.

Here’s an extreme example: a table with large rows of 4000 bytes. With 100,000 rows, it takes about 3 seconds to count the number of rows. With 10 rows, it still takes just as long to count the number of rows!

SQL> create table t as
select rpad('x', 4000, 'x') x
from dual
connect by level <= 100000;

Table T created.

SQL> select count(*) from t;

 COUNT(*)
----------
 100000

Elapsed: 00:00:03.101

SQL> delete from t where rownum <= 100000 - 100;

99,900 rows deleted.

SQL> commit;

Commit complete.

SQL> select count(*) from t;

 COUNT(*)
----------
 100

Elapsed: 00:00:04.026

Did you see what I did there? I created a table with 100,000 rows and one row per block, so 100,000 blocks (plus a few blocks of overhead). When I deleted most of the rows, that emptied out most of the blocks, but Oracle still had to access those blocks to find out whether they were empty or not!

The point is that “equality” (in time and resources) depends on the number of blocks too, not just on the number of rows, especially when doing a full table scan – which is usually the best way to do DML on an entire table.

Finally, to split a table by row count can be a costly operation, whereas splitting by block count can be done without ever accessing the table. We’ll see how in the next post.

Fixed chunk size or fixed chunk number?

The DBMS_PARALLEL_EXECUTE package proposes to create chunks with a fixed size, such as 64 blocks, and not a fixed number. Each process just does chunk after chunk until all are done. I think this is probably the most common use case. Requiring a fixed number of chunks, as Bryn has done, may be helpful to access a remote database through multiple DBLINKs.

Conclusion

  • It may be useful to do DML on a table by breaking it up into chunks of equal size, each of which is processed in a separate transaction.
  • “Equal size” can be measured in rows or blocks, but block-based chunks can be created efficiently without accessing the table.
  • DBMS_PARALLEL_EXECUTE creates chunks of a fixed size, but here we want to create a fixed number of chunks.

Now we need a way to create a fixed number of chunks (with equal numbers of blocks), and a way to access the data in each chunk efficiently. Maybe the next post will help us do that…

Current list of posts:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s