Chunking tables 4: Histograms

After “only” 3 posts on chunking tables, we now have the raw materials for breaking up a table into 12 chunks with equal numbers of blocks, and for getting ROWID ranges that let us access the data. Now let’s chunk!

Current list of posts:

What needs to be done?

The basic idea is pretty simple:

  1. Treat all the table blocks as if they were one contiguous set
  2. Number all the blocks in ascending sequence
  3. Filter out the extents that do not contain a chunk boundary
    (yes, I do this before the next step!)
  4. Calculate the chunk boundaries
  5. Map each chunk boundary to the corresponding block in the right extent
  6. Calculate each ROWID

One contiguous set

Blocks within an extent are contiguous, at least as seen by us. Blocks in different extents may not be contiguous; they may even live in different files or data objects. We don’t care about any of that as long as our ROWID ranges work. For that, we just need to order the “raw material” by data_object_id, file_id and block_id. That way all the extents (and the blocks in the extents) will be ordered by ascending ROWID value.

Number the blocks in ascending sequence

For this we can use the analytic SUM() function to get the running sum of blocks as we go through the extents in the correct order.

One more thing: to calculate the chunk boundaries, we are going to need the total number of blocks in the table. A different analytic SUM() function will provide that total.

To better demonstrate, I’m going to create a bigger table.

create table t as
select rpad('x', 4000, 'x') x
from dual
connect by level <= 130;

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = user
    and e.segment_name = 'T'
)
, extents_with_sums as (   
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks, 
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,   
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) as next_extent_block, 
    e.*
  from extents_data e 
) 
select * from extents_with_sums;
TOTAL_BLOCKS FIRST_EXTENT_BLOCK NEXT_EXTENT_BLOCK BLOCK_ID BLOCKS
256 0 8 128 8
256 8 16 136 8
256 16 24 144 8
256 24 32 152 8
256 32 40 160 8
256 40 48 168 8
256 48 56 176 8
256 56 64 184 8
256 64 72 192 8
256 72 80 200 8
256 80 88 208 8
256 88 96 216 8
256 96 104 224 8
256 104 112 232 8
256 112 120 240 8
256 120 128 248 8
256 128 256 256 128

I removed two columns from the output to make it fit better.

There are a lot of things to notice here, so bear with me:

  • I break the entire SQL query into a series of subqueries using WITH clauses.
  • Each WITH clause gets a “query block name” (qb_name) equal to the subquery name. I hope this will make it easier to analyze the execution plan.
  • In the numbering scheme, I chose to start with 0 and not with 1 because the math is a bit easier.
  • SUM(blocks) OVER() gives me the total number of blocks on every line.
  • Since I start with block 0, The running SUM() of blocks actually gives me the block number of the beginning of the next extent. To get the beginning of this extent, I just subtract the current number of blocks.
    (I could use a fancy “windowing clause” in the analytic function, but this seemed easier to explain.)
  • Since I am numbering every block consecutively, I can calculate the last block of any extent: it is one less than the first block of the following extent.
  • Since I am using ASSM, the last extent jumped from 8 blocks to 128 blocks.

Did I say histograms?

A histogram is a range divided into intervals that have identical size. Isn’t that exactly what we want? The range goes from 0 (the first block) up to but not including TOTAL_BLOCKS, and the number of intervals is the number of chunks, which is 12 in our example. To make sure no block gets put in two chunks, each interval includes the lower boundary and excludes the upper boundary. This is called a “closed-open” interval.

There is a function that does exactly what I just described:
WIDTH_BUCKET(expression, included_min_value, excluded_max_value, number_of_intervals)
or in this case
WIDTH_BUCKET(block, 0, 256, 12).

Let’s take every block from 0 to 255 and find out what chunk it is in.

select chunk, min(n) first_block, max(n) last_block,
max(n) + 1 - min(n) num_blocks
from (
  select level-2 n, 
  width_bucket(level-2,0,256,12) chunk
  from dual
  connect by level <= 258
)
group by chunk
order by chunk;
CHUNK FIRST_BLOCK LAST_BLOCK NUM_BLOCKS
0 -1 -1 1
1 0 21 22
2 22 42 21
3 43 63 21
4 64 85 22
5 86 106 21
6 107 127 21
7 128 149 22
8 150 170 21
9 171 191 21
10 192 213 22
11 214 234 21
12 235 255 21
13 256 256 1

 

I deliberately went outside the boundaries to show that they get put into “illegal” chunks: 0 and 13.

Notice that the number of blocks in each chunk is as equal as possible. This is looking pretty good.

There is one problem though: WIDTH_BUCKET answers the question “what chunk am I in?” but it doesn’t answer the question “what are the lower and upper boundaries of each chunk?” We’ll see how to do that later.

Filter unnecessary extents

This is a little performance boost that Jonathan Lewis praised. A very large table, or a table with lots of partitions, will have many rows in DBA_EXTENTS, but we are only interested in the rows that contain a chunk boundary. In our example, that means 24 rows at the most.

When does an extent not contain a chunk boundary? When the end of the previous extent and the beginning of the next extent both belong to the same chunk. Thanks to WIDTH_BUCKET and the running totals I just calculated, I can filter out those extents right now.

with extents_data as (...)
, extents_with_sums as (...) 
, filtered_extents as (  
  select /*+ qb_name(filtered_extents)*/ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, 12)
        as prev_chunk, 
      width_bucket(next_extent_block, 0, total_blocks, 12)
        as next_chunk, 
      e.* 
    from extents_with_sums e   
  )
  where prev_chunk < next_chunk
)
select prev_chunk, next_chunk,
  first_extent_block, next_extent_block, blocks
from filtered_extents;
PREV_CHUNK NEXT_CHUNK FIRST_EXTENT_BLOCK NEXT_EXTENT_BLOCK BLOCKS
0 1 0 8 8
1 2 16 24 8
2 3 40 48 8
3 4 56 64 8
3 4 64 72 8
4 5 80 88 8
5 6 104 112 8
6 7 120 128 8
6 13 128 256 128

 

If you compare these results with the FIRST_BLOCK and LAST_BLOCK numbers we calculated previously, you can see that the chunk boundaries are all contained within the extents we retained.

Are we done yet?

I wish I could say yes, but I can’t. We have gotten pretty far though. We have the total number of blocks, we have numbered all the blocks consecutively in the right order and we have retained only those extents that contain chunk boundaries.

In the next post I’m going to show how to calculate the chunk boundaries. Afterwards, I’ll demonstrate two methods for mapping those boundaries to extents in order to get the ROWIDs we want.

Current list of posts: 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s