Chunking tables 6: JOIN solution

At last, here is the first complete solution to the problem of dividing a table into 12 chunks with equal numbers of blocks. James Su and I proposed this solution on OTN, but without the “filtered_extents” optimisation.

The OTN solution is the basis for what Bryn Llewellyn calls “Approx_Method_Sql_Ashton_1” in his white paper.

Current list of posts: 

Preparation

drop table t purge;

create table t as
select rpad('x', 4000, 'x') x
from dual
connect by level <= 130;

var owner varchar2(128);
var table_name varchar2(128);
var num_chunks number;
begin
  :owner := user;
  :table_name := 'T';
  :num_chunks := 12;
end;
/

Up to now I have been hard-coding the owner name, the table name and the number of chunks (12). I hoped this would make the explanation a bit more concrete and therefore easier to follow. Now it’s time to remember that such values should be held in variables, both as a good programming practice and to allow Oracle to reuse the SQL statement without having to parse it every time. Here I am using “host variables”, because it is easier to test the query bit by bit when it is not wrapped in PL/SQL. Ultimately this SQL should live inside a PL/SQL function with formal parameters.

Overview

  1. extents_data: get all extents and data object ids for the table
  2. extents_with_sums:
    1. treat all the table blocks as if they were one contiguous set;
    2. number all the blocks in ascending sequence
    3. calculate total number of blocks
  3. filtered_extents: filter out the extents that do not contain a chunk boundary
  4. chunk_boundaries: calculate the chunk boundaries
  5. rowids:
    1. Use JOIN to map chunk boundaries to extents;
    2. calculate each ROWID
    3. group by chunk

extents_data

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = :owner
    and e.segment_name = :table_name
)

I wrote a series of named subqueries using the WITH() clause, and gave a “query block name” (qb_name) equal to the subquery name in order to better understand the execution plan. It turns out that the optimizer rewrites the query in such a way that some of the qb_names disappear.

This first clause provides the raw material for calculating the ROWIDs we need: the data object ids, the file ids, the block ids and the number of blocks per chunk.

extents_with_sums

, extents_with_sums as (
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - 1 as last_extent_block,
    e.*
  from extents_data e
) 

This subquery gets the total number of blocks (available in each row of the output), plus two running totals:

  • The number of blocks up to but not including the current extent.
    Since I am numbering the blocks consecutively starting from zero, this is the number assigned to the first block of the current extent.
  • The number of blocks up to and including the current extent, minus one.
    This is the number assigned to the last block of the current extent.

filtered_extents

, filtered_extents as (
  select /*+ qb_name(filtered_extents) */ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, :num_chunks)
        as prev_chunk,
      width_bucket(last_extent_block+1, 0, total_blocks, :num_chunks)
        as next_chunk,
      e.*
    from extents_with_sums e
  )
  where prev_chunk < next_chunk
)

Here I eliminate extents that do not contain any chunk boundaries.

  • WIDTH_BUCKET() tells me to which chunk a block belongs.
  • PREV_CHUNK is the chunk to which the end of the previous extent belongs.
  • NEXT_CHUNK is the chunk to which the start of the next extent belongs.
  • If they are the same, then the current extent is in the middle of a chunk and I can safely eliminate it.

For details on how I use WIDTH_BUCKET() here, and CEIL() in the next section, click here: https://stewashton.wordpress.com/2016/02/04/chunking-tables-4-histograms/

chunk_boundaries

, chunk_boundaries as (
  select  /*+qb_name(chunk_boundaries) */
    level chunk,
    CEIL( (level-1) * total_blocks / :num_chunks - 1 ) + 1 first_chunk_block,
    CEIL(  level    * total_blocks / :num_chunks - 1 )     last_chunk_block
  from (select total_blocks from extents_with_sums where rownum = 1)
  connect by level <= least(:num_chunks, total_blocks)
)

Having the extents and the extent boundaries, I now need the chunk boundaries.

  • I get TOTAL_BLOCKS by querying the first row from the EXTENTS_WITH_SUMS subquery.
  • Each chunk has a range equal to TOTAL_BLOCKS / :NUM_CHUNKS
  • The upper boundary of each chunk is LEVEL times that range
  • I need the last block that is less than the upper boundary, so I subtract 1 and then use CEIL().
  • least(:num_chunks, total_blocks) ensures that the number of chunks cannot be more than the total number of blocks. This is an alternative to raising an exception.

rowids

, rowids as (
  select  /*+qb_name(rowids) */ chunk,
    case when b.first_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + first_chunk_block - first_extent_block, 0
        )
    end as start_rowid,
    case when b.last_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + last_chunk_block - first_extent_block, 32767
        )
    end as end_rowid
  from chunk_boundaries b
  join filtered_extents e
  on b.first_chunk_block between e.first_extent_block and e.last_extent_block
   or b.last_chunk_block between e.first_extent_block and e.last_extent_block
)
select chunk, max(start_rowid) start_rowid, max(end_rowid) end_rowid
from rowids
group by chunk;

This is the heart of the solution. By joining the extents to the chunk boundaries, I automatically expand any extent that spans more than one chunk. If I ask for 12 chunks, the JOIN will return between 12 and 24 rows. I calculate START_ROWID only when appropriate, otherwise it is null – and the same goes for END_ROWID.

Finally, I condense the result into 12 rows using GROUP BY.

Evolution

My version of this solution has changed over time.

  • On OTN and in Bryn Llewellyn’s white paper, I did not filter the extents.
  • I now make sure my numbering scheme always starts with zero.
  • I have also incorporated WIDTH_BUCKET() and changed the way I calculate chunk boundaries to be coherent.

Next, another solution, this time without a join.

By the way, here is the code all together if anyone wants to copy and paste.

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = :owner
    and e.segment_name = :table_name
)
, extents_with_sums as (
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - 1 as last_extent_block,
    e.*
  from extents_data e
)
, filtered_extents as (
  select /*+ qb_name(filtered_extents) */ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, :num_chunks)
        as prev_chunk,
      width_bucket(last_extent_block+1, 0, total_blocks, :num_chunks)
        as next_chunk,
      e.*
    from extents_with_sums e
  )
  where prev_chunk < next_chunk
)
, chunk_boundaries as (
  select  /*+qb_name(chunk_boundaries) */
    level chunk,
    CEIL( (level-1) * total_blocks / :num_chunks - 1 ) + 1 first_chunk_block,
    CEIL(  level    * total_blocks / :num_chunks - 1 )     last_chunk_block
  from (select total_blocks from extents_with_sums where rownum = 1)
  connect by level <= least(:num_chunks, total_blocks)
)
, rowids as (
  select  /*+qb_name(rowids) */ chunk,
    case when b.first_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + first_chunk_block - first_extent_block, 0
        )
    end as start_rowid,
    case when b.last_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + last_chunk_block - first_extent_block, 32767
        )
    end as end_rowid
  from chunk_boundaries b
  join filtered_extents e
  on b.first_chunk_block between e.first_extent_block and e.last_extent_block
   or b.last_chunk_block between e.first_extent_block and e.last_extent_block
)
select chunk, max(start_rowid) start_rowid, max(end_rowid) end_rowid
from rowids
group by chunk;
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s