Predict_Clustering_Factor

Jonathan Lewis recently wrote about estimating the clustering factor of an index, taking into account the intended value of the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS. He included a function I wrote called predict_clustering_factor. Here is a corrected and improved version. Any questions or remarks about the code are welcome here.

Please refer to blog posts by Jonathan and Richard Foote for explanations of the clustering factor and the TABLE_CACHED_BLOCKS parameter.

UPDATE 2019-10-28: thanks to a comment by Viacheslav Andzhich on Jonathan’s post, I found that PLSQL_CODE_TYPE=’NATIVE’ makes the function run in two-thirds the time. He also mentioned the possibility of overflow on PLS_INTEGER, so I now use INTEGER only. Thanks Viacheslav!

Description

The function takes as input:
p_table_name: the table name
p_column_list: a comma-separated list of the column names to be indexed
p_max_table_cached_blocks: the MAXIMUM value of TABLE_CACHED_BLOCKS to be tested
(255 by default).

The function then queries the ROWIDs, sorted in index order:
– Rows are omitted if all index columns are NULL.
– In case of ties, rows are sorted by ROWID.
– Each ROWID has its “row number” portion set to 0, so it effectively becomes a block id.

The function collects the most recent p_max_table_cached_blocks block ids.
When there is no more room, it replaces the Least Recently Used block id with the current one.

lt_hits_per_RU is a table of intermediate results. The number of entries =  p_max_table_cached_blocks. When a new row has a block id among the most recent, I call that a “hit”. Every hit adds one to an entry in the result table: the index of the entry is based on how “recently used” the block id was. For example, if the current row and the previous row are in the same block, the index is one.

After all the rows are processed, the result table is transformed into a table of clustering factors: the clustering factor is the total number of rows minus all the hits up to and including the current entry.

In the output, the number of rows is equal to p_max_table_cached_blocks.
The ROWNUM provides the TABLE_CACHED_BLOCKS parameter value.

Sample of usage:

select rownum table_cached_blocks,
column_value clustering_factor
from table(predict_clustering_factor('T1','v1,rand'))

The code

ALTER SESSION SET PLSQL_CODE_TYPE='NATIVE';

create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.

The function takes as input:
- p_table_name: the table name
- p_column_list: a comma-separated list of the column names to be indexed
- p_max_table_cached_blocks: the MAXIMUM value of TABLE_CACHED_BLOCKS to be tested
  (255 by default).

The function then queries the ROWIDs, sorted in index order:
- Rows are omitted if all index columns are NULL.
- In case of ties, rows are sorted by ROWID.
- Each ROWID has its "row number" portion set to 0, so it effectively becomes a block id.

The function collects the most recent p_max_table_cached_blocks block ids.
When there is no more room, it replaces the Least Recently Used block id with the current one.

The function returns a table of clustering factors.
The number of rows is equal to p_max_table_cached_blocks.
The ROWNUM of the table is the TABLE_CACHED_BLOCKS parameter value.

Sample of usage:
  select rownum table_cached_blocks,
    column_value clustering_factor
  from table(predict_clustering_factor('T1','v1,rand'))
*/
  p_table_name in varchar2,
  p_column_list in varchar2,
  p_max_table_cached_blocks in number default 255
) return sys.odcinumberlist authid current_user is

  sql_text varchar2(4000);
  rc sys_refcursor;
  type tt_rids is table of rowid;
  lt_rids tt_rids;

  type t_block is record(
    block_id rowid,
    most_recent_hit integer
  );
  type tt_blocks is table of t_block;
  lt_blocks tt_blocks := new tt_blocks();
  l_block_id rowid;
  l_block_id_prev rowid;

  l_rn integer := 0;
  b_block_found boolean;
  l_LRU integer;
  i_LRU integer := 0;

  lt_hits_per_RU sys.odcinumberlist := new sys.odcinumberlist();
  i_hits_per_RU integer;

  function truncated_rid(p_rid in rowid) return rowid is
    l_rowid_type number;
    l_object_number NUMBER;
    l_relative_fno NUMBER;
    l_block_number NUMBER;
    l_row_number NUMBER;
  begin
    DBMS_ROWID.ROWID_INFO (
      p_rid,
      l_rowid_type,
      l_object_number,
      l_relative_fno,
      l_block_number,
      l_row_number
    );
    return DBMS_ROWID.ROWID_CREATE (
      l_rowid_type,
      l_object_number,
      l_relative_fno,
      l_block_number,
      0
    );
  end truncated_rid;

  function hits_per_RU(p_most_recent_hit in integer) return integer is
    i_hits_per_RU integer := 1;
  begin
    for i in 1..lt_blocks.count loop
      if lt_blocks(i).most_recent_hit > p_most_recent_hit then
        i_hits_per_RU := i_hits_per_RU + 1;
      end if;
    end loop;
    return i_hits_per_RU;
  end hits_per_RU;

begin
  -- Check for valid TABLE_CACHED_PARAMETER value
  if p_max_table_cached_blocks != trunc(p_max_table_cached_blocks)
  or p_max_table_cached_blocks not between 1 and 255 then
    raise_application_error(
      -20001,
      'input parameter p_max_table_cached_blocks must be an integer between 1 and 255'
    );
  end if;

  -- Initialize hits_per_RU table
  lt_hits_per_RU.extend(p_max_table_cached_blocks);
  for i in 1..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := 0;
  end loop;

  -- Execute query that mimics index
  sql_text :=
    'select rowid from '||p_table_name
    ||' where '||replace(p_column_list, ',', ' is not null or ')||' is not null'
    ||' order by '||p_column_list||', rowid';
  dbms_output.put_line('Query text: '||sql_text);
  open rc for sql_text;

  loop
    fetch rc bulk collect into lt_rids limit 10000;

    for irid in 1..lt_rids.count loop

      l_rn := l_rn + 1;
      l_block_id := truncated_rid(lt_rids(irid));

      -- Optimized treatment of first row
      if l_rn = 1 then
        lt_blocks.extend;
        lt_blocks(1).block_id := l_block_id;
        lt_blocks(1).most_recent_hit := l_rn;
        l_block_id_prev := l_block_id;
        continue;
      end if;

      -- Optimized treatment of consecutive rows in same block
      if l_block_id = l_block_id_prev then
        lt_hits_per_RU(1) := lt_hits_per_RU(1) + 1;
        continue;
      end if;

      l_block_id_prev := l_block_id;
      l_LRU := l_rn;
      b_block_found := false;

      for i in 1..lt_blocks.count loop

        -- if the new block_id is never found,
        -- i_LRU will point to the Least Recently Used block
        if l_LRU > lt_blocks(i).most_recent_hit then
          l_LRU := lt_blocks(i).most_recent_hit;
          i_LRU := i;
        end if;

        -- if the new block_id is found,
        -- then how many blocks ago was it found?
        if lt_blocks(i).block_id = l_block_id then
          b_block_found := true;
          -- how recently used is the block?
          i_hits_per_RU := hits_per_RU(lt_blocks(i).most_recent_hit);
          -- update hit summary
          lt_hits_per_RU(i_hits_per_RU) := lt_hits_per_RU(i_hits_per_RU) + 1;
          -- the block_id was just hit, so update most_recent_hit value
          lt_blocks(i).most_recent_hit := l_rn;
          exit;
        end if;

      end loop;

      -- If new block_id, add to lt_blocks if there is room,
      -- otherwise overwrite Least Recently Used entry
      if not b_block_found then
        if lt_blocks.count <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>< p_max_table_cached_blocks then
          lt_blocks.extend;
          lt_blocks(lt_blocks.count).block_id := l_block_id;
          lt_blocks(lt_blocks.count).most_recent_hit := l_rn;
        else
          lt_blocks(i_LRU).block_id := l_block_id;
          lt_blocks(i_LRU).most_recent_hit := l_rn;
        end if;
      end if;

    end loop;

    exit when rc%notfound;
  end loop;

  close rc;

-- Prepare output
  -- calculate cumulative sum of hits
  for i in 2..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := lt_hits_per_RU(i) + lt_hits_per_RU(i-1);
  end loop;
  -- subtract cumulative hits from total number of rows to get
  -- clustering factor. ROWNUM provides the TABLE_CACHED_BLOCKS value.
  for i in 1..lt_hits_per_RU.count loop
    lt_hits_per_RU(i) := l_rn - lt_hits_per_RU(i);
  end loop;

  dbms_output.put_line('Total number of rows in index = '|| l_rn);
  return lt_hits_per_RU;

exception when others then
  if rc%isopen then
    close rc;
  end if;
  raise;

end predict_clustering_factor;
/

5 thoughts on “Predict_Clustering_Factor

  1. Hello Stew,

    As I am following (always !) your blog, I just received the e-mail below, but I don’t understand what is the meaning of the post being password protected.

    Thanks a lot in advance if you could enlighten my understanding.

    And, kudos for the nice PL/SQL procedure you sent to Jonathan Lewis J

    Cheers & Best Regards, Iudith Mentzel

    • Hi Iudith,

      I published the post by accident as I thought of asking Jonathan to look at it first. The password protection was in order to share it with him only. Since the emails went out so quickly, I have removed the password and all should be normal now.

      Best regards, Stew

  2. Hello Stew,

    A magnificent post, as always from you :) :)

    By the way, coming back to the match_recognize solution attempt ( using the “wrong interpretation” ),
    I just mentioned on Jonathan’s blog that maybe you could take the challenge to prepare a (super-advanced) match_recognize presentation for a deeper analysis of the performance implications of using “high quantifier values” in match_recognize patterns.

    That would become for sure a super-star presentation for any upcoming event ( our ILOUG events included, of course :) )

    Thanks a lot once again for all your excellent posts :)

    Best Regards,
    Iudith

  3. Hello Stew,

    Thanks a lot for the reminders :)
    I will definitely review your OOW presentation, which I somehow missed.

    I think that the issue of backtracking deserves a little more “digging” into …
    As by your presentation, during backtracking Oracle acts by “giving rows back” one by one,
    which I think it is the same as it happens in regular expressions processing.

    In Jonathan’s last post, however, there is a slight difference in the explanation of why backtracking consumes a lot of CPU,
    namely, because when it does not find a match, it goes back as far as the start row of the current match for trying to find a match
    with a lower value for the quantifier.

    We, mere mortals, can never know enough about this topic, so any further post from you would be of course always welcome :)

    Cheers & Best Regards,
    Iudith

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s