Spreadsheet-like Totals and Subtotals

We very often make spreadsheets with subtotals for each row and for each column. Someone on the OTN forum asked how to produce data in this format. I answered using the cool CUBE function.

The question used SCOTT.EMP as input. The requirement was to sum salaries by DEPTNO and JOB and to display them as follows:

JOB 10 20 30 TOTAL
ANALYST 6000 6000
CLERK 1300 1900 950 4150
MANAGER 2450 2975 2850 8275
PRESIDENT 5000 5000
SALESMAN 5600 5600
 Total 8750 10875 9400 29025

 

  • The white cells with numbers contain subtotals by DEPTNO and JOB;
  • the yellow cells (right hand column) contain subtotals by JOB;
  • the blue cells (bottom row) contain subtotals by DEPTNO;
  • and the green cell (bottom right) contains the grand total.

Getting all the totals

The CUBE extension to GROUP BY is ideal for this kind of cross-tabular report: it will generate everything we need with one SELECT and one table scan.

select deptno, job, sum(sal) sal
from scott.emp
group by cube(deptno, job);
DEPTNO JOB SAL
29025
CLERK 4150
ANALYST 6000
MANAGER 8275
SALESMAN 5600
PRESIDENT 5000
10 8750
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
20 10875
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
30 9400
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600

 

Formatting the output

Some folks whose opinion I respect say that formatting reports should be done outside of SQL. I agree in principle, but that didn’t stop me from answering the question using the PIVOT clause. As always with this clause, you have to know in advance how many columns you want to end up with!

The tricky part of this particular pivoting operation is handling NULLs correctly. For one thing, the JOB subtotals need to be pivoted to the rightmost column, but they have no DEPTNO value to pivot to. For another thing, the input might have NULLs in the JOB or DEPTNO columns, so I need a reliable way to identify the output rows that have subtotals.

I use the GROUPING() function to identify the subtotals:

  • When GROUPING(DEPTNO) is equal to 1, the row contains a JOB subtotal (or the grand total) and I have to assign an arbitrary DEPTNO value so I can pivot.
  • When GROUPING(JOB) is equal to 1, the row contains a DEPTNO subtotal (or the grand total) so after pivoting I output ‘Total’ in the JOB column of the last row.
select case gr_job when 1 then 'Total' else job end job,
  "10", "20", "30", "Total"
from (
  select case grouping(deptno) when 1 then -1 else deptno end deptno,
    job, grouping(job) gr_job, sum(sal) sal
  from scott.emp
  group by cube(deptno, job)
) 
pivot(  
  sum(sal) for deptno in (10, 20, 30, -1 as "Total")  
)  
order by gr_job, job;
JOB 10 20 30 Total
ANALYST 6000 6000
CLERK 1300 1900 950 4150
MANAGER 2450 2975 2850 8275
PRESIDENT 5000 5000
SALESMAN 5600 5600
Total 8750 10875 9400 29025

 

New, Improved IN Lists!

In SQL, the “varying IN list” problem comes up constantly and there are many published solutions. My new favorite solution is inspired from the video Synthesizing rows inside Oracle by Connor McDonald (start at the 6 minute mark).

The idea is to extract a series of delimited values from a string, and to present those values one row at a time. For example, the string
A,BB,CCC,DDDD,EEEEE

would become:
A
BB
CCC
DDDD
EEEEE

I’m not going to bother you with the other solutions, except for two remarks:

  1. Tom Kyte’s classic solution with SUBSTR() and INSTR() is a bit complicated and doesn’t work with strings longer than 3998 bytes.
  2. The popular REGEXP_SUBSTR() solution is a bit complicated and the REGEXP* functions use rather a lot of CPU.

Connor’s idea is efficient, works with the longest strings and seems slightly less complicated to me. However, that’s not going to stop me from explaining my variant in excruciating detail!

Generating rows

Like Connor, I’ll start with a reminder how easy and efficient it is to generate rows from DUAL using CONNECT BY:

select level from dual connect by level <= 5;
LEVEL
1
2
3
4
5

 

Extracting a value

So I have 5 rows, big deal… Now I need to put the right value in each row, which SUBSTR() can do. To extract the third value for example:

var txt varchar2(20);
exec :txt := 'A,BB,CCC,DDDD,EEEEE';

select :txt "Input",
  6 "Position",
  3 "Output length",
  substr(:txt, 6, 3) CCC
from dual;
Input Position Output length CCC
A,BB,CCC,DDDD,EEEEE 6 3 CCC

 

Locating the delimiters

As you probably know, or can guess, the key to getting the right positions is to locate the delimiters. This is a job for INSTR():

select :txt "Input",
  ',' "Substring",
  1 "Position",
  2 "Occurence",
  instr( :txt, ',', 1, 2 ) "Pos after BB"
from dual;
Input Substring Position Occurence Pos after BB
A,BB,CCC,DDDD,EEEEE , 1 2 5

 

Now let’s locate all the commas:

select instr(:txt, ',', 1, level) pos
from dual
connect by level <= 5;
POS
2
5
9
14
0

 

Do you see how clever Connor is there? There are only 4 commas, but 5 rows are needed. That last row contains a zero because INSTR() did not find a fifth comma. I can pretend that zero is a “virtual comma” before the first value.

[Update 2016-06-23: changed the CONNECT BY to avoid repeated unnecessary calls to INSTR().]

Of course, I mustn’t hard-code that “5” at the end of the statement. Instead,

  • I’ll calculate the length of the input;
  • I’ll calculate the length of the input without commas;
  • by subtraction I get the number of commas,
  • then I add 1 to get the number of output rows.
select instr(:txt, ',', 1, level) pos
from dual
connect by
  level <= length(:txt) - nvl(length(replace(:txt, ',', '')), 0) + 1;
POS
2
5
9
14
0

 

Putting it all together

Now that I have all my commas, with one “virtual comma” at position zero, I just add one to the comma positions and I have the starting point of each value. The length is equal to the next position, minus this position, minus 1.

select substr(
  :txt,
  pos + 1,
  lead(pos) over(order by pos) - pos - 1
) subs
from (
  select instr(:txt, ',', 1, level) pos
  from dual
  connect by
    level <= length(:txt) - nvl(length(replace(:txt, ',', '')), 0) + 1
);
SUBS
A
BB
CCC
DDDD

 

What happened on that last line? The LEAD() function tried to access a row that doesn’t exist, so it returned NULL. Fortunately, we can make LEAD() return another value in that situation:

select substr(
  :txt,
  pos + 1,
  lead(pos, 1, 4000) over(order by pos) - pos - 1
) subs
from (
  select instr(:txt, ',', 1, level) pos
  from dual
  connect by
    level <= length(:txt) - nvl(length(replace(:txt, ',', '')), 0) + 1
);
SUBS
A
BB
CCC
DDDD
EEEEE

 

Conclusion

This to me is an elegant and efficient solution that demonstrates at least 3 clever things we can do with Oracle SQL:

  1. Use CONNECT BY LEVEL to generate as many rows as we want;
  2. Use the zero, returned by INSTR() when it finds nothing, as a “virtual comma” before the first value;
  3. Use the “default” parameter of LEAD() to get a “virtual comma” after the last value.

Kim Berg Hansen on “Use Cases of Row Pattern Matching in Oracle 12c”

As I write this, I am listening to Kim Berg Hansen explain the MATCH_RECOGNIZE clause. He was kind enough to give me credit for some of the examples and mention this blog.

In addition to my blog posts on the subject, you may enjoy my presentation on SlideShare. Please download it to see the animations!

If you have questions arising from the presentation, please add a comment here.

Bravo to Kim for his remarkably interactive webinar!

Chunking tables 8: second solution

Here is the second complete solution to the problem of dividing a table into chunks of equal size. This is the solution whose first version was marked “correct” on OTN. It is the basis for what Bryn Llewellyn calls “Approx_Method_Sql_Ashton_2” in his white paper.

Current list of posts: 

Overview

  1. extents_data: get all extents and data object ids for the table
  2. extents_with_sums:
    1. treat all the table blocks as if they were one contiguous set
    2. number all the blocks in ascending sequence
    3. calculate total number of blocks
  3. filtered_extents: filter out the extents that do not contain a chunk boundary
  4. expanded_extents: expand every row that crosses chunk boundaries so that there will be one row per chunk
  5. rowids:
    1. calculate each ROWID
    2. group by chunk.

filtered_extents

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = :owner
    and e.segment_name = :table_name
)
, extents_with_sums as (   
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks, 
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,   
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) as next_extent_block, 
    e.* 
  from extents_data e 
) 
, filtered_extents as (  
  select /*+ qb_name(filtered_extents)*/ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, :num_chunks)
        as prev_chunk, 
      width_bucket(first_extent_block, 0, total_blocks, :num_chunks)
        as first_chunk, 
      width_bucket(next_extent_block-1, 0, total_blocks, :num_chunks)
        as last_chunk,  
      width_bucket(next_extent_block, 0, total_blocks, :num_chunks)
        as next_chunk, 
      e.* 
    from extents_with_sums e   
  )
  where prev_chunk < next_chunk
)

The first two named subqueries are identical to the first solution. The filtered_extents subquery differs only by calculating more columns, since I will need them all later:

  • PREV_CHUNK is the chunk to which the end of the previous extent belongs.
  • FIRST_CHUNK and LAST_CHUNK are the chunks containing the first and last block of the current extent.
  • NEXT_CHUNK is the chunk to which the start of the next extent belongs.

expanded_extents

, expanded_extents as (
  select /*+ qb_name(expanded_extents) */
    first_chunk + level - 1 as chunk,
    prev_chunk, next_chunk, data_object_id, file_id,
    block_id, total_blocks, first_extent_block
  from filtered_extents
  connect by first_extent_block = prior first_extent_block
    and prior sys_guid() is not null
    and first_chunk + level - 1 <= last_chunk
)

This subquery used the method explained in my previous post to expand each row that crosses chunk boundaries. The only difference among the expanded rows is the chunk number assigned to each.

rowids

, rowids as (
  select /*+ qb_name(rowids) */ chunk,
    case when chunk > prev_chunk then dbms_rowid.rowid_create(
      1, data_object_id, file_id,
      block_id + CEIL( (chunk-1) * total_blocks / :num_chunks - 1 ) + 1 - first_extent_block,
      0
    ) 
    end as start_rowid,
    case when chunk < next_chunk then dbms_rowid.rowid_create(
      1, data_object_id, file_id,
      block_id + CEIL( chunk * total_blocks / :num_chunks - 1 )  - first_extent_block,
      32767
    ) 
    end as end_rowid
  from expanded_extents
)
select chunk, min(start_rowid) start_rowid, max(end_rowid) end_rowid
from rowids
group by chunk;

This subquery calculates the starting and ending rowids for each chunk. It is much more compact than the way I did this previously.

By comparing the current chunk number with prev_chunk and next_chunk, I can tell if I should calculate a starting rowid, an ending rowid or both. To calculate the block number,

  • I start from the block_id of the extent
  • I add the block number of the target chunk boundary
  • then I subtract the number I assigned previously to the first block of the extent.

At this point, each chunk may be present in one row or two. A simple GROUP BY gives me one row per chunk.

Evolution

My version of this solution has changed over time, especially while writing this series.

  • I now make sure my numbering scheme always starts with zero.
  • I have also incorporated WIDTH_BUCKET() and changed the way I calculate chunk boundaries to be coherent.
  • Above all, I have simplified the code that says when to calculate the rowids, so that the KEEP (DENSE_RANK...) functions are no longer necessary. I apologize for the earlier complicated code.

By the way, here is the code all together if anyone wants to copy and paste.

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = :owner
    and e.segment_name = :table_name
)
, extents_with_sums as (   
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks, 
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,   
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) as next_extent_block, 
    e.* 
  from extents_data e 
) 
, filtered_extents as (  
  select /*+ qb_name(filtered_extents)*/ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, :num_chunks)
        as prev_chunk, 
      width_bucket(first_extent_block, 0, total_blocks, :num_chunks)
        as first_chunk, 
      width_bucket(next_extent_block-1, 0, total_blocks, :num_chunks)
        as last_chunk,  
      width_bucket(next_extent_block, 0, total_blocks, :num_chunks)
        as next_chunk, 
      e.* 
    from extents_with_sums e   
  )
  where prev_chunk < next_chunk
)
, expanded_extents as (
  select /*+ qb_name(expanded_extents) */
    first_chunk + level - 1 as chunk,
    prev_chunk, next_chunk, data_object_id, file_id,
    block_id, total_blocks, first_extent_block
  from filtered_extents
  connect by first_extent_block = prior first_extent_block
    and prior sys_guid() is not null
    and first_chunk + level - 1 <= last_chunk
)
, rowids as (
  select /*+ qb_name(rowids) */ chunk,
    case when chunk > prev_chunk then dbms_rowid.rowid_create(
      1, data_object_id, file_id,
      block_id + CEIL( (chunk-1) * total_blocks / :num_chunks - 1 ) + 1 - first_extent_block,
      0
    ) 
    end as start_rowid,
    case when chunk < next_chunk then dbms_rowid.rowid_create(
      1, data_object_id, file_id,
      block_id + CEIL( chunk * total_blocks / :num_chunks - 1 )  - first_extent_block,
      32767
    ) 
    end as end_rowid
  from expanded_extents
)
select chunk, min(start_rowid) start_rowid, max(end_rowid) end_rowid
from rowids
group by chunk;

Chunking tables 7: prior sys_guid() ???

The second solution for dividing a table into equal chunks does not do a JOIN. Instead it expands extent rows that contain several chunk boundaries, using an obscure method that I need to explain.

Current list of posts: 

Generating rows

To generate rows quickly, it is common to use the CONNECT BY clause.

select level num
from dual
connect by level <= 5;
NUM
1
2
3
4
5

 

This is a simple “hierarchical” query that uses the LEVEL pseudocolumn. LEVEL starts at 1 and increments every time the CONNECT BY is executed. The CONNECT BY condition says to keep going as long as LEVEL is less than or equal to 5.

Note that there is no reference to a PRIOR row in the condition, since there is only one row in the DUAL table.

Expanding rows

Here is where things get tricky.

Suppose I have two rows and I want to expand both of them to get one row for every integer in the range. For example, take this table:

drop table U purge;

create table U as
select 1 range_id, 2 range_end from dual
union all
select 2, 3 from dual;

select * from u;
RANGE_ID RANGE_END
1 2
2 3

 

Using the very same technique as before:

select range_id, range_end, level
from u
connect by level <= range_end;
RANGE_ID RANGE_END LEVEL
1 2 1
1 2 2
2 3 3
2 3 2
2 3 3
2 3 1
1 2 2
2 3 3
2 3 2
2 3 3

 

What is this mess? It looks like I’m starting with each row and connecting to the other row – which makes sense because I’m not saying to stay on the same row. Let’s try again:

select range_id, range_end, level
from u
connect by level <= range_end
and range_id = prior range_id

Error report - SQL Error: ORA-01436: CONNECT BY loop in user data

Now I made a reference to something PRIOR – the range_id. Oracle sees that the same range_id is accessed twice in a row, so it assumes there is an infinite loop and aborts the execution.

There is a way to avoid that error, using the NOCYCLE keyword:

select range_id, range_end, level
from u
connect by nocycle level <= range_end
and range_id = prior range_id;
RANGE_ID RANGE_END LEVEL
1 2 1
2 3 1

 

Well, I didn’t get the error, but Oracle still considers that doing the same range_id twice would be a loop, so it stops first.

Enter SYS_GUID()

What we need is to add something to the prior row that will make Oracle think it is different. SYS_GUID() is a very low-cost function that returns a nonrepeating value. If we refer to PRIOR SYS-GUID() in a condition, that is enough to make the prior row unique and to prevent the perception of an infinite loop.

select range_id, range_end, level
from u
connect by level <= range_end
and range_id = prior range_id
and prior sys_guid() is not null;
RANGE_ID RANGE_END LEVEL
1 2 1
1 2 2
2 3 1
2 3 2
2 3 3

 

In the next post, I will use this technique to expand extent rows that contain multiple chunks.

P.S. If you have version 12c, you can use the LATERAL clause, which will expand each row without bending your mind.

select * from u,
lateral(
  select level lvl from dual
  connect by level <= range_end
);

Chunking tables 6: JOIN solution

At last, here is the first complete solution to the problem of dividing a table into 12 chunks with equal numbers of blocks. James Su and I proposed this solution on OTN, but without the “filtered_extents” optimisation.

The OTN solution is the basis for what Bryn Llewellyn calls “Approx_Method_Sql_Ashton_1” in his white paper.

Current list of posts: 

Preparation

drop table t purge;

create table t as
select rpad('x', 4000, 'x') x
from dual
connect by level <= 130;

var owner varchar2(128);
var table_name varchar2(128);
var num_chunks number;
begin
  :owner := user;
  :table_name := 'T';
  :num_chunks := 12;
end;
/

Up to now I have been hard-coding the owner name, the table name and the number of chunks (12). I hoped this would make the explanation a bit more concrete and therefore easier to follow. Now it’s time to remember that such values should be held in variables, both as a good programming practice and to allow Oracle to reuse the SQL statement without having to parse it every time. Here I am using “host variables”, because it is easier to test the query bit by bit when it is not wrapped in PL/SQL. Ultimately this SQL should live inside a PL/SQL function with formal parameters.

Overview

  1. extents_data: get all extents and data object ids for the table
  2. extents_with_sums:
    1. treat all the table blocks as if they were one contiguous set;
    2. number all the blocks in ascending sequence
    3. calculate total number of blocks
  3. filtered_extents: filter out the extents that do not contain a chunk boundary
  4. chunk_boundaries: calculate the chunk boundaries
  5. rowids:
    1. Use JOIN to map chunk boundaries to extents;
    2. calculate each ROWID
    3. group by chunk

extents_data

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = :owner
    and e.segment_name = :table_name
)

I wrote a series of named subqueries using the WITH() clause, and gave a “query block name” (qb_name) equal to the subquery name in order to better understand the execution plan. It turns out that the optimizer rewrites the query in such a way that some of the qb_names disappear.

This first clause provides the raw material for calculating the ROWIDs we need: the data object ids, the file ids, the block ids and the number of blocks per chunk.

extents_with_sums

, extents_with_sums as (
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - 1 as last_extent_block,
    e.*
  from extents_data e
) 

This subquery gets the total number of blocks (available in each row of the output), plus two running totals:

  • The number of blocks up to but not including the current extent.
    Since I am numbering the blocks consecutively starting from zero, this is the number assigned to the first block of the current extent.
  • The number of blocks up to and including the current extent, minus one.
    This is the number assigned to the last block of the current extent.

filtered_extents

, filtered_extents as (
  select /*+ qb_name(filtered_extents) */ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, :num_chunks)
        as prev_chunk,
      width_bucket(last_extent_block+1, 0, total_blocks, :num_chunks)
        as next_chunk,
      e.*
    from extents_with_sums e
  )
  where prev_chunk < next_chunk
)

Here I eliminate extents that do not contain any chunk boundaries.

  • WIDTH_BUCKET() tells me to which chunk a block belongs.
  • PREV_CHUNK is the chunk to which the end of the previous extent belongs.
  • NEXT_CHUNK is the chunk to which the start of the next extent belongs.
  • If they are the same, then the current extent is in the middle of a chunk and I can safely eliminate it.

For details on how I use WIDTH_BUCKET() here, and CEIL() in the next section, click here: https://stewashton.wordpress.com/2016/02/04/chunking-tables-4-histograms/

chunk_boundaries

, chunk_boundaries as (
  select  /*+qb_name(chunk_boundaries) */
    level chunk,
    CEIL( (level-1) * total_blocks / :num_chunks - 1 ) + 1 first_chunk_block,
    CEIL(  level    * total_blocks / :num_chunks - 1 )     last_chunk_block
  from (select total_blocks from extents_with_sums where rownum = 1)
  connect by level <= least(:num_chunks, total_blocks)
)

Having the extents and the extent boundaries, I now need the chunk boundaries.

  • I get TOTAL_BLOCKS by querying the first row from the EXTENTS_WITH_SUMS subquery.
  • Each chunk has a range equal to TOTAL_BLOCKS / :NUM_CHUNKS
  • The upper boundary of each chunk is LEVEL times that range
  • I need the last block that is less than the upper boundary, so I subtract 1 and then use CEIL().
  • least(:num_chunks, total_blocks) ensures that the number of chunks cannot be more than the total number of blocks. This is an alternative to raising an exception.

rowids

, rowids as (
  select  /*+qb_name(rowids) */ chunk,
    case when b.first_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + first_chunk_block - first_extent_block, 0
        )
    end as start_rowid,
    case when b.last_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + last_chunk_block - first_extent_block, 32767
        )
    end as end_rowid
  from chunk_boundaries b
  join filtered_extents e
  on b.first_chunk_block between e.first_extent_block and e.last_extent_block
   or b.last_chunk_block between e.first_extent_block and e.last_extent_block
)
select chunk, max(start_rowid) start_rowid, max(end_rowid) end_rowid
from rowids
group by chunk;

This is the heart of the solution. By joining the extents to the chunk boundaries, I automatically expand any extent that spans more than one chunk. If I ask for 12 chunks, the JOIN will return between 12 and 24 rows. I calculate START_ROWID only when appropriate, otherwise it is null – and the same goes for END_ROWID.

Finally, I condense the result into 12 rows using GROUP BY.

Evolution

My version of this solution has changed over time.

  • On OTN and in Bryn Llewellyn’s white paper, I did not filter the extents.
  • I now make sure my numbering scheme always starts with zero.
  • I have also incorporated WIDTH_BUCKET() and changed the way I calculate chunk boundaries to be coherent.

Next, another solution, this time without a join.

By the way, here is the code all together if anyone wants to copy and paste.

with extents_data as (
  select /*+ qb_name(extents_data) */
    o.data_object_id, e.file_id, e.block_id, e.blocks
  from dba_extents e
  join all_objects o
  on e.owner = o.owner
    and e.segment_name = o.object_name
    and e.segment_type = o.object_type
    and decode(e.partition_name, o.subobject_name, 0, 1) = 0
  where e.segment_type like 'TABLE%'
    and e.owner = :owner
    and e.segment_name = :table_name
)
, extents_with_sums as (
  select /*+ qb_name(extents_with_sums) */
    sum(blocks) over() as total_blocks,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - blocks as first_extent_block,
    sum(blocks) over(
      order by data_object_id, file_id, block_id
    ) - 1 as last_extent_block,
    e.*
  from extents_data e
)
, filtered_extents as (
  select /*+ qb_name(filtered_extents) */ * from (
    select
      width_bucket(first_extent_block-1, 0, total_blocks, :num_chunks)
        as prev_chunk,
      width_bucket(last_extent_block+1, 0, total_blocks, :num_chunks)
        as next_chunk,
      e.*
    from extents_with_sums e
  )
  where prev_chunk < next_chunk
)
, chunk_boundaries as (
  select  /*+qb_name(chunk_boundaries) */
    level chunk,
    CEIL( (level-1) * total_blocks / :num_chunks - 1 ) + 1 first_chunk_block,
    CEIL(  level    * total_blocks / :num_chunks - 1 )     last_chunk_block
  from (select total_blocks from extents_with_sums where rownum = 1)
  connect by level <= least(:num_chunks, total_blocks)
)
, rowids as (
  select  /*+qb_name(rowids) */ chunk,
    case when b.first_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + first_chunk_block - first_extent_block, 0
        )
    end as start_rowid,
    case when b.last_chunk_block
      between e.first_extent_block and e.last_extent_block then
        dbms_rowid.rowid_create(
          1, data_object_id, file_id,
          block_id + last_chunk_block - first_extent_block, 32767
        )
    end as end_rowid
  from chunk_boundaries b
  join filtered_extents e
  on b.first_chunk_block between e.first_extent_block and e.last_extent_block
   or b.last_chunk_block between e.first_extent_block and e.last_extent_block
)
select chunk, max(start_rowid) start_rowid, max(end_rowid) end_rowid
from rowids
group by chunk;

Chunking tables 5: chunk boundaries

We still want to split a table into 12 chunks with equal numbers of blocks, so we can access each chunk using a rowid range. We have the extents we need and now we need to calculate the exact chunk boundaries.

Current list of posts: 

Histograms and WIDTH_BUCKET

I explained in my previous post that WIDTH_BUCKET is based on a histogram having identical intervals in a range.

  • The minimum range value is included in the range but the maximum range value is not.
  • The same thing is true for each interval: the minimum value is included in the interval, but not the maximum value.

We have already used WIDTH_BUCKET to answer the question: “what chunk is this block in?” Now we are asking the opposite question: “what blocks start and end this chunk?” We need to make absolutely sure the answers are coherent!

Getting the upper chunk boundary

Using our test table from the previous post, let’s compare what WIDTH_BUCKET tells us with a wrong answer and a right answer. Then I’ll explain in detail. Right now, I’m only interested in those blocks that end a chunk.

select this_block_chunk, block last_block,
  round(256 * (this_block_chunk / 12), 1) max_chunk_val,
  floor(256 * (this_block_chunk / 12)) floor_val,
  ceil (256 * (this_block_chunk / 12) - 1) right_val
from (
  select level-1 block,
  width_bucket(level-1,0,256,12) this_block_chunk,
  width_bucket(level,0,256,12) next_block_chunk
  from dual
  connect by level <= 256
)
where this_block_chunk < next_block_chunk;
THIS_BLOCK_CHUNK LAST_BLOCK MAX_CHUNK_VAL FLOOR_VAL RIGHT_VAL
1 21 21.3 21 21
2 42 42.7 42 42
3 63 64 64 63
4 85 85.3 85 85
5 106 106.7 106 106
6 127 128 128 127
7 149 149.3 149 149
8 170 170.7 170 170
9 191 192 192 191
10 213 213.3 213 213
11 234 234.7 234 234
12 255 256 256 255

 

For each block, I asked WIDTH_BUCKET what chunk it is in and what chunk the next block is in, and I kept only those blocks where the next block is in the next chunk. That means I have a list of the last blocks for each chunk.

In the third column, I calculate the maximum value for each chunk. Notice it is not always an integer value. How can we calculate the greatest block number that is less than that maximum value?

  • Using CEIL() or ROUND(), we will sometimes get a result that is greater than the maximum value.
  • Using FLOOR(), we will sometimes get a result that is equal to the maximum value – but we need to be less than, not equal to! You can see the problem in chunks 3, 6, 9 and 12.
  • The right answer is to subtract 1 from the maximum value and then use CEIL(). That way the result will always be less than the maximum value, but as close to it as possible.

Getting the lower chunk boundary

I’ll bet you thought of this yourself. The lower boundary of a chunk is one greater that the upper boundary of the previous chunk :-)

Coming next…

We have the extents we need, and we have a method for calculating all the chunk boundaries. In the next post, I’ll present the first of two complete solutions for chunking a table.

Current list of posts: