My Presentations at #UKOUG #techfest19

I just finished my second presentation at TechFest19 in sunny Brighton, England. The conference this year is great from every point of view: venue, people, content, conversations… Thanks to those who came to my talks for your benevolent attention and excellent questions/remarks!

Both my presentations are now on SlideShare. I beg you to download them so the animations work; if you don’t, you will see a jumble.

Best wishes to all for the end of the conference and the trip home.

Regards, Stew

Make your Data Dance: UNPIVOT, PIVOT and GROUP BY extensions

JSON in Oracle database 18c and 19c

MATCH_RECOGNIZE Restrictions

Sometimes we think that certain restrictions are not documented when in fact they are. Where do we forget to look? Database Error Messages

As mentioned recently on asktom.oracle.com (thanks, Connor), all error messages and event codes are listed in $ORACLE_HOME/rdbms/mesg/oraus.msg. Searching for “pattern matching”, we read “62500 – 62549 are reserved for PATTERN MATCHING”.

So here is a link to all the MATCH_RECOGNIZE error messages: ORA-60001 to ORA-65535

As of version 19c, the range actually used is 62500 through 62521.

It was a surprise to discover that “bounded quantifiers” are limited to 200, but there it is in black and white:

“ORA-62518: MATCH_RECOGNIZE pattern is too large.”
“Permute with more than 7 elements or bounded quantifier bound greater than 200 are … currently not supported.”

I’m not suggesting we spend our lives reading database error messages, but we have to admit they are documented ;-)

Best regards,
Stew

P.S. Where are the values of UB2MAXVAL and UB4MAXVAL documented? If you know, please post a link in the comments below. I suspect UB2MAXVAL is 0XFFFF and UB4MAXVAL is 0XFFFFFFFF.

FOOTNOTE 2019-11-05 : thanks to Maxim Demenko for pointing out where UB*MAXVAL are documented. In the OCI Programmer’s Guide, at the very end of the chapter on Data Types, it says

“Throughout this guide there are references to data types like ub2 or sb4, or to constants like UB4MAXVAL. These types are defined in the oratypes.h header file, which is found in the public directory. The exact contents may vary according to the operating system that you are using.”

LISTAGG() as a Cumulative Function

LISTAGG() can be used as an analytic function, but cannot be cumulative. What does that mean, and is there a workaround?

I’ll demonstrate with the boring old DEPT and EMP tables, just because they have so few rows, and I’ll keep only the department and employee names:

create or replace view denames as
select dname, ename from dept join emp using(deptno);

select * from denames
order by dname, ename;
DNAME ENAME
ACCOUNTING CLARK
ACCOUNTING KING
ACCOUNTING MILLER
RESEARCH ADAMS
RESEARCH FORD
RESEARCH JONES
RESEARCH SCOTT
RESEARCH SMITH
SALES ALLEN
SALES BLAKE
SALES JAMES
SALES MARTIN
SALES TURNER
SALES WARD

 

Reminder: aggregate versus analytic

“Aggregate functions return a single result row based on groups of rows, rather than on single rows.” (19c doc)

select dname,
  listagg(ename, ',') within group(order by ename) enames
from denames
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

By contrast, “analytic functions…differ from aggregate functions in that they return multiple rows for each group…All joins and all WHERE, GROUP BY, and HAVING clauses are completed before the analytic functions are processed.” (19c). In other words, the output rows are already determined and any analytic function simply adds a column to each existing row. The “analytic clause”, starting with the keyword OVER(), says what rows provide data for the function.

select de.*,
  listagg(ename, ',') within group(order by ename)
    over(partition by dname)
  enames
from denames de;
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK,KING,MILLER
ACCOUNTING KING CLARK,KING,MILLER
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH FORD ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH JONES ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT,SMITH
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES BLAKE ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES JAMES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES MARTIN ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES TURNER ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
SALES WARD ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

LISTAGG() Limitations

The “analytic clause” has three parts:

  1. PARTITION BY: like GROUP BY, separates the data into chunks based on one or more values;
  2. ORDER BY: within the “partition”, orders the rows and sets up the next part;
  3. Windowing clause: says what rows within the ordered partition are concerned by the function.

Some analytic functions allow only the first or second parts, and LISTAGG() is one of them. Let’s try to use it as a cumulative function:

select de.*,
  listagg(ename, ',') within group(order by ename)
    over(partition by dname order by ename)
  enames
from denames de;

SQL Error: ORA-30487: ORDER BY not allowed here
30487. 00000 -  "ORDER BY not allowed here"
*Cause:    DISTINCT functions and RATIO_TO_REPORT cannot have an ORDER BY

The error message is a bit strange, but the conclusion is: it doesn’t work.

The workaround

The idea is simple: take what LISTAGG() gives us and do a SUBSTR() based on the row number:

select dname, ename,
  substr(enames, 1, instr(enames, ',', 1, rn) - 1) enames
from (
  select de.*,
    listagg(ename, ',') within group(order by ename)
      over(partition by dname)
    || ','
    enames,
    row_number() over(partition by dname order by ename) rn
  from denames de
);
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN ALLEN
SALES BLAKE ALLEN,BLAKE
SALES JAMES ALLEN,BLAKE,JAMES
SALES MARTIN ALLEN,BLAKE,JAMES,MARTIN
SALES TURNER ALLEN,BLAKE,JAMES,MARTIN,TURNER
SALES WARD ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 

Caveats

This solution seems fine, as long as the source values are NOT NULL. If there is a NULL value, the “cumulative value” becomes NULL. Look at the last row of the output.

update emp set ename = null where ename = 'ALLEN';

(repeat the query)
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES BLAKE BLAKE
SALES JAMES BLAKE,JAMES
SALES MARTIN BLAKE,JAMES,MARTIN
SALES TURNER BLAKE,JAMES,MARTIN,TURNER
SALES WARD BLAKE,JAMES,MARTIN,TURNER,WARD
SALES

 

Also, with version 19c LISTAGG() allows the DISTINCT keyword. If you want to use DISTINCT, then instead of ROW_NUMBER() use DENSE_RANK()!

update emp set ename = 'MARTIN' where ename is null;

select dname, ename,
  substr(enames, 1, instr(enames, ',', 1, rn) - 1) enames
from (
  select de.*,
    listagg(DISTINCT ename, ',') within group(order by ename)
      over(partition by dname)
    || ','
    enames,
    DENSE_RANK() over(partition by dname order by ename) rn
  from denames de
);
DNAME ENAME ENAMES
ACCOUNTING CLARK CLARK
ACCOUNTING KING CLARK,KING
ACCOUNTING MILLER CLARK,KING,MILLER
RESEARCH ADAMS ADAMS
RESEARCH FORD ADAMS,FORD
RESEARCH JONES ADAMS,FORD,JONES
RESEARCH SCOTT ADAMS,FORD,JONES,SCOTT
RESEARCH SMITH ADAMS,FORD,JONES,SCOTT,SMITH
SALES BLAKE BLAKE
SALES JAMES BLAKE,JAMES
SALES MARTIN BLAKE,JAMES,MARTIN
SALES MARTIN BLAKE,JAMES,MARTIN
SALES TURNER BLAKE,JAMES,MARTIN,TURNER
SALES WARD BLAKE,JAMES,MARTIN,TURNER,WARD

 
With these precautions, why not use LISTAGG() as a cumulative function if you want? Hope this helps…

Making Longer Lists

For very long lists, we need the return values to be CLOBs. Alas, LISTAGG can only return VARCHAR2 lists, but XML and JSON aggregate functions can return CLOBs!

As an exercise, I’ll start from ALL_OBJECTS: I want to make a list of all fully qualified object names for each object type. To start with, a simple subquery:

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select * from data;

OBJECT_TYPE  OBJ
INDEX	     SYS.I_FILE#_BLOCK#
INDEX	     SYS.I_OBJ3
INDEX	     SYS.I_TS1
INDEX	     SYS.I_CON1
...

XML

XML functions work on the XMLTYPE data type, which has practically no length limitations. To make my list, I have to go through three steps:

  1. Make an XML fragment out of each object.
    This step will escape any special characters such as '<' or '>'.
  2. Aggregate the XML fragments.
  3. Extract the text from the aggregated XML data.
    This step will unescape the special characters.

I have to handle myself the comma that separates the elements of the list.

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(O, ',' || obj)
      order by obj
    ) AS CLOB
  ), 2
) qualified_objects
from data
group by object_type
order by object_type;

On my system with about 78000 objects, this takes 3 seconds and a bit.

JSON

JSON support in the Oracle database is pretty new. For full support of CLOBs, we need versions 18c or later (or a patched version 12.2, but I don’t know the precise version number).

The steps are a little different with JSON:

  1. I can aggregate the data directly into a JSON array.
    This escapes any special characters like double quotes, newlines, etc.
  2. To turn the multiple array items into one (which is my list), I replace "," (which separates the array items) with my separator.
  3. I then use JSON_VALUE to extract the list from the array.
    This unescapes any character that needs it.
with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
  json_value(
    replace(
      json_arrayagg(obj order by obj returning clob),
      '","',
      ','
    ),
    '$[0]' returning clob
  ) qualified_objects
from data
group by object_type
order by object_type;

Instead of 3+ seconds, this code takes a bit over 1 second, so three times faster.

As a reminder, JSON is great for going the other way: split the items of a list into a table. See Splitting Strings: a New Champion!

Interval expressions

I just learned about these critters thanks to an ODC forum question: how to calculate the difference between two dates in calendar years and months, days, hours, minutes and seconds.

What are they, and why are they?

Take any two datetime values (DATE, TIMESTAMP, TIMESTAMP WITH (LOCAL) TIME ZONE) and subtract one from the other: what do you get?

  • With two dates, you get the number of days.
  • With two timestamps, you get an INTERVAL DAY TO SECOND.

If you always want the interval, just put the two values in parentheses and add DAY TO SECOND. You can add YEAR TO MONTH instead to get an INTERVAL YEAR TO MONTH.

All the following combinations are OK:

select (sysdate - sysdate) day to second from dual;
select (sysdate - sysdate) year to month from dual;
select (sysdate - systimestamp) day to second from dual;
select (sysdate - systimestamp) year to month from dual;
select (systimestamp - sysdate) day to second from dual;
select (systimestamp - sysdate) year to month from dual;
select (systimestamp - systimestamp) day to second from dual;
select (systimestamp - systimestamp) year to month from dual;

Interval expressions are a handy way to get intervals (especially year-month intervals) from all datetime datatypes with the same code.

INTERVAL YEAR TO MONTH

INTERVAL YEAR TO MONTH is based on calendar years and months, so the “same” interval may cover a different number of days depending on the context. For example, adding INTERVAL '1' MONTH  to January 1 will add 31 days, but adding it to February 1, 2019 will add 28 days. In fact, sometimes it won’t add any days at all! Tell Oracle to add 1 month to January 31 and it will just say February 31 doesn’t exist. (Actually the error message is not quite that specific…).

The other surprise is rounding: since both year and month portions are integers, the month portion is rounded according to a rule I don’t fully understand:

  • (DATE '2016-02-29' - DATE '2016-02-15') year to month is 0 months, whereas
  • (DATE '2016-03-01' - DATE '2016-02-16') year to month is 1 month.

Keeping all this in mind, we can still use intervals and interval expressions to calculate the difference between two datetime values in calendar years and months, days, hours, minutes and seconds.

The answer, step by step

To split things up into smaller portions that are easier to understand, I’ll start with a view called QUESTION. We can change this view whenever we want different test data. Here I’ll test all possible date-only ranges (all time portions are 00:00:00) in the years 2015 through 2017.

create or replace view question as  
with dates(dte) as (  
  select date '2014-12-31' + level  
  from dual  
  connect by level <= 365+366+365  
)  
, ranges(date_from, date_to) as (  
  select a.dte, b.dte  
  from dates a, dates b  
  where a.dte < b.dte  
)  
select * from ranges;

My first draft of an answer is straightforward:

  • The year-month interval goes from the first day of the “from” month to the first day of the “to” month.
  • The day-second interval =
    • the day-second interval from date_from to date_to
    • minus the day-second interval from the first day of the “from” month to the first day of the “to” month.
create or replace view answer as  
select date_from, date_to,  
  (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month as iym,  
  (date_to - date_from) day(7) to second -  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
  as ids  
from question;

This solution lets us do “round trips”: if we have date_from, iym and ids, we can calculate date_to, and if we have date_to, iym and ids we can calculate date_from.

select a.*,  
  case when ids < interval '0' second  
    then date_from + ids + iym  
    else date_from + iym + ids  
  end new_to,  
  case when ids < interval '0' second  
    then date_to - iym - ids  
    else date_to - ids - iym  
  end new_from  
from answer a;

The only problem is that IDS, which is the day-second interval, is negative almost half the time. We can use that fact to calculate properly, but users will find it strange.

In order to satisfy the users, here is a second draft:

  • When IDS is negative
    • subtract 1 from IYM
    • add to IDS the number of days in the “from” month

If we want to avoid ambiguous results, we still have to say when the original IDS was negative.
To simplify “round trips”, I decided to add an IDS_ADJUSTMENT column with the number of days in the “from” month.

create or replace view adjusted_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

This gives plausible results to the users. The same DATE_FROM, IYM and IDS combination will sometimes occur for two different DATE_TO values, but the IDS_ADJUSTMENT column will allow us to do “round trips”.

select a.*,  
  case when ids_adjustment > interval '0' second  
    then date_from  
      + (ids - ids_adjustment)  
      + (iym + interval '1' month)  
    else date_from + iym + ids  
  end new_to,  
  case when ids_adjustment > interval '0' second  
    then date_to  
      - (iym + interval '1' month)  
      - (ids - ids_adjustment)  
    else date_to - ids - iym  
  end new_from      
from adjusted_answer a;

Again, almost half the ranges will have a positive IDS_ADJUSTMENT value. We can eliminate that when the “from” month and the month preceding date_to have the same number of days. (See lines 11,12 and 24.)

create or replace view final_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days,  
    (trunc(date_to,'mm') - (trunc(date_to,'mm') - interval '1' month)) day to second  
    as before_to_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second and from_month_days != before_to_month_days  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

Now IDS_ADJUSTMENT will only be positive 25% of the time (with this test data!) We can do “round trips” the same way as with the “adjusted answer”.

As a bonus, thanks to “interval expressions” we can change the data types in the QUESTION view and the solution will still work with no changes to the code.

Conclusion

We can calculate a sensible difference (including years and months) between two dates, but unless we provide additional information we cannot calculate one date using the other date and the difference.

Elegant MODEL clause

Frank Kulash recently asked a challenging question on OTN, which Chris227 answered using the MODEL clause. To help  fully understand his excellent answer, I’m going to go into detail. (To concentrate on the modeling concepts, I’ve changed a few details in the problem and the solution.)

Teams Drafting Players

There are five teams that take turns choosing among eight players.

TEAM_NAME PLAYER_NAME
Atletico Mane
Burnley Neymar
Crystal Ronaldo
Dortmund Silva
Everton Weah
Xavi
Yorke
Zidane

 

The teams choose in alphabetical order. Once a player is picked, he is of course unavailable to be picked by another team. In fact, a team does not pick directly: it gives a rating to each player that it wants, and the available player with the highest rating is automatically assigned to the team. To simplify the problem, each team picks only once.

Here are the ratings, ordered by TEAM_NAME, RATING DESC, with the picks in bold.

TEAM_NAME PLAYER_NAME RATING
Atletico Mane 90
Atletico Neymar 80
Atletico Ronaldo 60
Burnley Silva 91
Burnley Weah 71
Burnley Zidane 51
Crystal Silva 82
Crystal Weah 72
Crystal Mane 62
Dortmund Silva 93
Dortmund Weah 83
Dortmund Mane 73
Everton Weah 94
Everton Silva 84
Everton Yorke 74
  • Atletico and Burney get their top rated players (Mane and Silva).
  • Crystal gets its second rated (Weah) because its first rated (Silva) is taken.
  • Dortmund has only rated three players and they are all taken, so Dortmund gets no one.
  • Everton gets its third rated player (Yorke) because the first two are taken.

For input, here is Frank’s table with the ratings for each team. Notice that (team_name, player_name) is unique, and (team_name, rating) is unique.

CREATE TABLE    choice (team_name, player_name, rating)    AS
SELECT  'Atletico', 'Mane',    90  FROM dual  UNION ALL
SELECT  'Atletico', 'Neymar',  80  FROM dual  UNION ALL      -- combination of
SELECT  'Atletico', 'Ronaldo', 60  FROM dual  UNION ALL      -- (team_name, player_name)
SELECT  'Burnley',  'Silva',   91  FROM dual  UNION ALL      -- is unique; so is
SELECT  'Burnley',  'Weah',    71  FROM dual  UNION ALL      -- (team_name, rating)
SELECT  'Burnley',  'Zidane',  51  FROM dual  UNION ALL
SELECT  'Crystal',  'Silva',   82  FROM dual  UNION ALL
SELECT  'Crystal',  'Weah',    72  FROM dual  UNION ALL
SELECT  'Crystal',  'Mane',    62  FROM dual  UNION ALL
SELECT  'Dortmund', 'Silva',   93  FROM dual  UNION ALL
SELECT  'Dortmund', 'Weah',    83  FROM dual  UNION ALL
SELECT  'Dortmund', 'Mane',    73  FROM dual  UNION ALL
SELECT  'Everton',  'Weah',    94  FROM dual  UNION ALL
SELECT  'Everton',  'Silva',   84  FROM dual  UNION ALL
SELECT  'Everton',  'Yorke',   74  FROM dual;

Starting to MODEL

select * from choice
model
dimension by ( team_name, player_name)
measures ( rating, 0 picked )
rules (
)
order by team_name, rating desc;

The MODEL clause turns the input data into a multi-dimensional array.

  • The DIMENSION clause lists the array indexes: here they are strings, not integers. The dimension columns identify each row uniquely, and their values cannot be changed.
  • The MEASURES clause lists columns whose values can be changed.
    A measure column in an individual row is called a cell: for example, RATING['Atletico', 'Mane']
  • The RULES clause lists rules that say what changes should be made. We’ll see rules in a bit.

Here we see what the data looks like before any rules are applied.

TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 0
Atletico Neymar 80 0
Atletico Ronaldo 60 0
Burnley Silva 91 0
Burnley Weah 71 0
Burnley Zidane 51 0
Crystal Silva 82 0
Crystal Weah 72 0
Crystal Mane 62 0
Dortmund Silva 93 0
Dortmund Weah 83 0
Dortmund Mane 73 0
Everton Weah 94 0
Everton Silva 84 0
Everton Yorke 74 0

 

What is a RULE?

A rule assigns a value to a cell. For example

PICKED['Atletico', 'Mane'] = 1

The left-hand side is the cell, and the right-hand side is the value assigned to that cell.

The left-hand side can refer to more than one cell:

  • PICKED['Atletico', ANY] = 1 will assign 1 to the PICKED cells for team ‘Atletico’, no matter who the player is;
  • PICKED[ANY, 'Mane'] = 1 will assign 1 to the PICKED cells for player ‘Mane’, no matter what the team is;
  • PICKED[ANY, ANY] = 1 will assign 1 to each and every PICKED cell.

The right-hand side contains a value, but that value can come from a cell. To use the same dimension values on the left-hand side and the right-hand side, we can use CV().

  • PICKED[ANY, ANY] = RATING[CV(),CV()] will assign to each PICKED cell the value RATING from the same row.

Each rule applies to only one cell at a time. If the left-hand side refers to several cells, the right-hand side is evaluated and assigned separately to each left-hand cell. The result of the evaluation must be one value, not a range of values.

Aggregations in Rules

On the right-hand side of a rule, we can aggregate over a range of cells to get one value.

  • PICKED[ANY, ANY] = max(RATING)[CV(),ANY] will assign to each PICKED cell the maximum value of RATING for that team. Notice the parentheses end before the dimension columns.

Ordering in Rules

Since each rule is applied to only one left-side cell at a time, the order of execution can be important. Look what happens with this rule:

picked[any,any] order by team_name, rating desc = max(picked)[any,any] + 1

TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 1
Atletico Neymar 80 2
Atletico Ronaldo 60 3
Burnley Silva 91 4
Burnley Weah 71 5
Burnley Zidane 51 6
Crystal Silva 82 7
Crystal Weah 72 8
Crystal Mane 62 9
Dortmund Silva 93 10
Dortmund Weah 83 11
Dortmund Mane 73 12
Everton Weah 94 13
Everton Silva 84 14
Everton Yorke 74 15

 

Do you see what happened here? For every left-side cell, the right-side expression is evaluated using the previously assigned values! That is why the order is so important.

The Solution to our problem

If we order the rows by team_name, rating desc then we can say for each row:

  • If the current row’s team has not picked
  • and the current row’s player has not been picked,
  • then pick that player for that team.
select * from choice
model
dimension by ( team_name, player_name)
measures ( rating, 0 picked )
rules (
  picked[any,any] order by team_name, rating desc = 
    case
      when max(picked)[any,cv()] = 0 and max(picked)[cv(),any] = 0
      then 1
      else 0
    end    
);
TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 1
Atletico Neymar 80 0
Atletico Ronaldo 60 0
Burnley Silva 91 1
Burnley Weah 71 0
Burnley Zidane 51 0
Crystal Silva 82 0
Crystal Weah 72 1
Crystal Mane 62 0
Dortmund Silva 93 0
Dortmund Weah 83 0
Dortmund Mane 73 0
Everton Weah 94 0
Everton Silva 84 0
Everton Yorke 74 1

 

Multiple Rounds

To let the teams pick more than one player, we can just add the ITERATE keyword. The ITERATION_NUMBER function starts with 0.

select * from choice
model
dimension by ( team_name, player_name)
measures ( rating, 0 picked )
rules iterate (3) (
  picked[any,any] order by team_name, rating desc = 
    case
      when max(picked)[any,cv()] = 0 and max(picked)[cv(),any] <= iteration_number
      then iteration_number + 1
      else picked[cv(),cv()]
    end    
);
TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 1
Atletico Neymar 80 2
Atletico Ronaldo 60 3
Burnley Silva 91 1
Burnley Weah 71 0
Burnley Zidane 51 2
Crystal Silva 82 0
Crystal Weah 72 1
Crystal Mane 62 0
Dortmund Silva 93 0
Dortmund Weah 83 0
Dortmund Mane 73 0
Everton Weah 94 0
Everton Silva 84 0
Everton Yorke 74 1

 

Did you make it this far?

If so, thanks for your patience! I thought this was a great example of the power of MODEL , because it uses ordering, aggregation and previously changed values in its evaluations – all in one pretty concise rule.

Hope this helps…

Missing cents when dividing an amount

Alex Nuijten (@alexnuijten) wrote a blog post on this subject a few years ago. Since his blog doesn’t accept comments at the moment, here is a variant I think is interesting.

The problem

As an example, say we have $100 that we want to divide into 42 parts.

  • ROUND(100/42,2) gives us $2.38
  • Multiplying $2.38 by 42 gives us $99.96
  • So to get a total of 100, we need to add one cent to four lines.

Adding cents to the first four lines

Here is Alex’s solution:

with amounts
as
( select level id
      ,100 amount
      ,count(*) over () entries
      ,round (100 / count(*) over (), 2) rounded
    from dual
    connect by level <= 42
)
,indicators as 
(
select id
      ,amount
      ,entries
      ,rounded
      ,case 
       when rownum <= abs ((amount - (entries * rounded)) * 100)
       then 'x'
       end as indicator
      ,sign (amount - (entries * rounded)) as pos_neg
 from amounts
)
select id
      ,rounded +
       (nvl2 (indicator, 0.01, 0) * pos_neg) final_amount
  from indicators
;
ID FINAL_AMOUNT
1 2.39
2 2.39
3 2.39
4 2.39
5 2.38
6 2.38
7 2.38
8 2.38
9 2.38

 

Adding cents using a running total

Personally, I would prefer to spread the extra cents more evenly across the lines. To do that, I simply round the current running total, then subtract the previous lines rounded running total. This turns out to be less code.

with data(id, amount, entries) as (
  select level, 100, count(*) over()
  from dual connect by level <= 42
)
select id, amount, entries,
  round(id*amount/entries, 2) - round((id-1)*amount/entries, 2) split
from data;

The lines with an extra cent are 6, 16, 27 and 37.

Hope this helps. Regards, Stew

UPDATE 2019-10-12: in the comments, mathguy proposed an alternative saying “no analytic functions and no subqueries needed”. In fact, if the amount and the number of entries are given as inputs, no analytic function is needed, and the WITH clause is simply a matter of style. Also, paulzipblog mentioned that the amount could be negative. My solution works with negative amounts and, given the right inputs, requires neither analytic functions or subqueries.

with inputs(amount, entries) as (select -100, 42 from dual)
select level as id, amount, entries,
  round(level*amount/entries, 2) - round((level-1)*amount/entries, 2) split
from inputs
connect by level <= entries;

My #OUGN19 Presentations

As promised, here are the two presentations I gave on the OUGN 2019 Boat conference. I had a great time with old friends and new, and managed to have discussions about politics without anyone getting upset! Hope to see some of you again at OUG Ireland!

Please download the presentations to see the animations.

https://www.slideshare.net/stewashton/json-in-18c-and-19c

https://www.slideshare.net/stewashton/make-your-data-dance

Splitting strings before they’re CLOBs!

After I tokenized two CLOBs in response to a forum question, Andy Sayer mentioned that they were probably files first. So I smacked my forehead and used external tables. When the light bulb went on, I saw that the comma needed to be a record delimiter, not a field separator. Easy.

create table item_ids_EXT(  
  "RN" NUMBER,  
  "ITEM_ID" VARCHAR2(16)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_ID" char(255)  
    )  
  )  
  location ('i_item_id.txt')  
)  
REJECT LIMIT 10;

create table item_keys_EXT(  
  "RN" NUMBER,  
  "ITEM_KEY" VARCHAR2(255)  
)  
ORGANIZATION EXTERNAL(  
  TYPE ORACLE_LOADER  
  DEFAULT DIRECTORY EXT_DATA_DIR  
  ACCESS PARAMETERS(  
    records delimited by ','   
    BADFILE EXT_DATA_DIR:'i_item_id.txt.bad'  
    LOGFILE EXT_DATA_DIR:'i_item_id.txt.log'  
    FIELDS TERMINATED BY "§" NOTRIM  
    MISSING FIELD VALUES ARE NULL  
    REJECT ROWS WITH ALL NULL FIELDS  
    (  
      "RN" recnum,  
      "ITEM_KEY" char(255)  
    )  
  )  
  location ('i_item_key.txt')  
)  
REJECT LIMIT 10;  

set timing on

create table u as  
select item_id, item_key  
from item_ids_ext  
join item_keys_ext using(rn);

Table U created.

Elapsed: 00:00:00.787

select floor(2 * 60 * 60 / 0.787) times_faster from dual;

TIMES_FASTER
------------
        9148

Splitting strings when they are CLOBs

Like many of you, I have enjoyed using SQL to make data processing go hundreds of times faster. My new record is: 8000 times faster :-)

In a recent ODC thread, the question was how to split out comma separated values from two CLOBs, one containing “item ids” and the other “item keys”. The ids and the keys had to be matched by position: the first id with the first key, the second id with the second key, and so on. Then each id/key pair was to be inserted as a row into a target table.

With 50,000 ids of 7 characters and 50,000 keys of 172 characters, the current processing takes 2 hours.

Test data

First I’ll create a table with the two CLOBs.

--drop table t purge;
create table t as
select 
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || to_char(level,'fm0000009'))
      order by level
    ) AS CLOB
  ), 2
) i_item_id,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || rpad(to_char(level,'fm0000009'),172,'*'))
      order by level
    ) AS CLOB
  ), 2
) i_item_key
from dual
connect by level <= 50000;

Splitting CLOB strings

Instead of seeking a “perfect” custom-made solution, I just put together two functions I have already written about:

  1. The PIPE_CLOB function: a pipelined table function that reads a CLOB and outputs up to 4000 bytes at a time, respecting a delimiter.
  2. The STRING_TOKENIZE function: a pipelined table function that splits a VARCHAR2 string according to a delimiter.

All I have to do is feed the output from the first function into the second, and voilà!

select b.column_value item_id
from t,
table(pipe_clob( i_item_id, 4000, ',')) a,
table(string_tokenize(a.column_value,',')) b;

Combining the two CLOBs

Remember, the requirement was to combine the values from both CLOBs by position, in order to insert them into a target table. Since the functions return the values in order, I just use ROWNUM to number the values and use that number to join them.

--drop table u purge;
create table u as
with item_ids as (
  select rownum rn,
    b.column_value item_id
    from t,
  table(pipe_clob( i_item_id, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
, item_keys as (
  select rownum rn,
    b.column_value item_key
    from t,
  table(pipe_clob( i_item_key, 4000, ',')) a,
  table(string_tokenize(a.column_value,',')) b
)
select item_id, item_key 
from item_ids join item_keys using(rn);

Table U created.

Elapsed: 00:00:00.879

That’s right, processing time went from two hours to less than one second.

Now, I suspect that the questioner’s code is doing more than just a simple insert, but until he or she provides more details about the requirement, I get to maintain my claim:

8000 times faster!

(which I hope applies only to my code, not to the speed at which time flies…)