Making Longer Lists

For very long lists, we need the return values to be CLOBs. Alas, LISTAGG can only return VARCHAR2 lists, but XML and JSON aggregate functions can return CLOBs!

As an exercise, I’ll start from ALL_OBJECTS: I want to make a list of all fully qualified object names for each object type. To start with, a simple subquery:

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select * from data;

OBJECT_TYPE  OBJ
INDEX	     SYS.I_FILE#_BLOCK#
INDEX	     SYS.I_OBJ3
INDEX	     SYS.I_TS1
INDEX	     SYS.I_CON1
...

XML

XML functions work on the XMLTYPE data type, which has practically no length limitations. To make my list, I have to go through three steps:

  1. Make an XML fragment out of each object.
    This step will escape any special characters such as '<' or '>'.
  2. Aggregate the XML fragments.
  3. Extract the text from the aggregated XML data.
    This step will unescape the special characters.

I have to handle myself the comma that separates the elements of the list.

with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(O, ',' || obj)
      order by obj
    ) AS CLOB
  ), 2
) qualified_objects
from data
group by object_type
order by object_type;

On my system with about 78000 objects, this takes 3 seconds and a bit.

JSON

JSON support in the Oracle database is pretty new. For full support of CLOBs, we need versions 18c or later (or a patched version 12.2, but I don’t know the precise version number).

The steps are a little different with JSON:

  1. I can aggregate the data directly into a JSON array.
    This escapes any special characters like double quotes, newlines, etc.
  2. To turn the multiple array items into one (which is my list), I replace "," (which separates the array items) with my separator.
  3. I then use JSON_VALUE to extract the list from the array.
    This unescapes any character that needs it.
with data as (
  select object_type, owner ||'.'|| object_name obj
  from all_objects
)
select object_type,
  json_value(
    replace(
      json_arrayagg(obj order by obj returning clob),
      '","',
      ','
    ),
    '$[0]' returning clob
  ) qualified_objects
from data
group by object_type
order by object_type;

Instead of 3+ seconds, this code takes a bit over 1 second, so three times faster.

As a reminder, JSON is great for going the other way: split the items of a list into a table. See Splitting Strings: a New Champion!

Advertisements

Interval expressions

I just learned about these critters thanks to an ODC forum question: how to calculate the difference between two dates in calendar years and months, days, hours, minutes and seconds.

What are they, and why are they?

Take any two datetime values (DATE, TIMESTAMP, TIMESTAMP WITH (LOCAL) TIME ZONE) and subtract one from the other: what do you get?

  • With two dates, you get the number of days.
  • With two timestamps, you get an INTERVAL DAY TO SECOND.

If you always want the interval, just put the two values in parentheses and add DAY TO SECOND. You can add YEAR TO MONTH instead to get an INTERVAL YEAR TO MONTH.

All the following combinations are OK:

select (sysdate - sysdate) day to second from dual;
select (sysdate - sysdate) year to month from dual;
select (sysdate - systimestamp) day to second from dual;
select (sysdate - systimestamp) year to month from dual;
select (systimestamp - sysdate) day to second from dual;
select (systimestamp - sysdate) year to month from dual;
select (systimestamp - systimestamp) day to second from dual;
select (systimestamp - systimestamp) year to month from dual;

Interval expressions are a handy way to get intervals (especially year-month intervals) from all datetime datatypes with the same code.

INTERVAL YEAR TO MONTH

INTERVAL YEAR TO MONTH is based on calendar years and months, so the “same” interval may cover a different number of days depending on the context. For example, adding INTERVAL '1' MONTH  to January 1 will add 31 days, but adding it to February 1, 2019 will add 28 days. In fact, sometimes it won’t add any days at all! Tell Oracle to add 1 month to January 31 and it will just say February 31 doesn’t exist. (Actually the error message is not quite that specific…).

The other surprise is rounding: since both year and month portions are integers, the month portion is rounded according to a rule I don’t fully understand:

  • (DATE '2016-02-29' - DATE '2016-02-15') year to month is 0 months, whereas
  • (DATE '2016-03-01' - DATE '2016-02-16') year to month is 1 month.

Keeping all this in mind, we can still use intervals and interval expressions to calculate the difference between two datetime values in calendar years and months, days, hours, minutes and seconds.

The answer, step by step

To split things up into smaller portions that are easier to understand, I’ll start with a view called QUESTION. We can change this view whenever we want different test data. Here I’ll test all possible date-only ranges (all time portions are 00:00:00) in the years 2015 through 2017.

create or replace view question as  
with dates(dte) as (  
  select date '2014-12-31' + level  
  from dual  
  connect by level <= 365+366+365  
)  
, ranges(date_from, date_to) as (  
  select a.dte, b.dte  
  from dates a, dates b  
  where a.dte < b.dte  
)  
select * from ranges;

My first draft of an answer is straightforward:

  • The year-month interval goes from the first day of the “from” month to the first day of the “to” month.
  • The day-second interval =
    • the day-second interval from date_from to date_to
    • minus the day-second interval from the first day of the “from” month to the first day of the “to” month.
create or replace view answer as  
select date_from, date_to,  
  (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month as iym,  
  (date_to - date_from) day(7) to second -  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
  as ids  
from question;

This solution lets us do “round trips”: if we have date_from, iym and ids, we can calculate date_to, and if we have date_to, iym and ids we can calculate date_from.

select a.*,  
  case when ids < interval '0' second  
    then date_from + ids + iym  
    else date_from + iym + ids  
  end new_to,  
  case when ids < interval '0' second  
    then date_to - iym - ids  
    else date_to - ids - iym  
  end new_from  
from answer a;

The only problem is that IDS, which is the day-second interval, is negative almost half the time. We can use that fact to calculate properly, but users will find it strange.

In order to satisfy the users, here is a second draft:

  • When IDS is negative
    • subtract 1 from IYM
    • add to IDS the number of days in the “from” month

If we want to avoid ambiguous results, we still have to say when the original IDS was negative.
To simplify “round trips”, I decided to add an IDS_ADJUSTMENT column with the number of days in the “from” month.

create or replace view adjusted_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

This gives plausible results to the users. The same DATE_FROM, IYM and IDS combination will sometimes occur for two different DATE_TO values, but the IDS_ADJUSTMENT column will allow us to do “round trips”.

select a.*,  
  case when ids_adjustment > interval '0' second  
    then date_from  
      + (ids - ids_adjustment)  
      + (iym + interval '1' month)  
    else date_from + iym + ids  
  end new_to,  
  case when ids_adjustment > interval '0' second  
    then date_to  
      - (iym + interval '1' month)  
      - (ids - ids_adjustment)  
    else date_to - ids - iym  
  end new_from      
from adjusted_answer a;

Again, almost half the ranges will have a positive IDS_ADJUSTMENT value. We can eliminate that when the “from” month and the month preceding date_to have the same number of days. (See lines 11,12 and 24.)

create or replace view final_answer as  
with details as (  
  select date_from, date_to,  
    (trunc(date_to,'mm') - trunc(date_from,'mm')) year(5) to month  
    as iym,  
    (date_to - date_from) day(7) to second -  
      (trunc(date_to,'mm') - trunc(date_from,'mm')) day(7) to second  
    as ids,  
    ((trunc(date_from,'mm') + interval '1' month) - trunc(date_from,'mm')) day to second  
    as from_month_days,  
    (trunc(date_to,'mm') - (trunc(date_to,'mm') - interval '1' month)) day to second  
    as before_to_month_days  
  from question  
)  
select date_from, date_to,  
  case when ids < interval '0' second  
    then iym - interval '1' month -- subtract 1 month  
    else iym  
  end iym,  
  case when ids < interval '0' second  
    then ids + from_month_days --add number of days in "from" month  
    else ids  
  end ids,  
  case when ids < interval '0' second and from_month_days != before_to_month_days  
    then from_month_days  
    else interval '0' second  
  end ids_adjustment    
from details;

Now IDS_ADJUSTMENT will only be positive 25% of the time (with this test data!) We can do “round trips” the same way as with the “adjusted answer”.

As a bonus, thanks to “interval expressions” we can change the data types in the QUESTION view and the solution will still work with no changes to the code.

Conclusion

We can calculate a sensible difference (including years and months) between two dates, but unless we provide additional information we cannot calculate one date using the other date and the difference.

Making Lists

Strings that contain multiple values violate first normal form, which mandates “one value per column per row”.  People still ask how to make or massage lists of values. This post is about aggregating values into lists.

Rewrite 2019-07-11: I am now using DNAME instead of DEPTNO so that the JOIN makes sense. Also Tony Hasler pointed out that JOIN before MODEL does not need an inline view.

(To split lists into little tables, see my series on “Splitting strings”.)

Chunks of lists

(For a discussion of this topic that has influenced my solutions, see Managing overflows in LISTAGG by Keith Laker, with solutions provided by Lei Sheng.)

Version 11.2 introduced LISTAGG, which concatenates values with or without a delimiter. For example:

select dname,
  listagg(ename, ',') within group(order by ename) enames
from dept join emp using(deptno)
group by dname
order by dname;
DNAME ENAMES
ACCOUNTING CLARK,KING,MILLER
RESEARCH ADAMS,FORD,JONES,SCOTT,SMITH
SALES ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

 
Unfortunately, the return value cannot be a CLOB; if it is too long, we get an ORA-01489 exception. Version 12.2 provides the ON OVERFLOW clause that truncates the return value, so no error but we don’t get all the values back.

Even before version 12.2, we could get all the values if we divided the list into chunks with a valid maximum length.

Starting with 12.1 and using the MATCH_RECOGNIZE clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select dname, ename from dept join emp using(deptno)
)
match_recognize(
  partition by dname order by ename
  measures match_number() chunk
  all rows per match
  pattern(s b*)
  define b as sum(lengthb(ename)) + count(b.*) <= 20
)
group by dname, chunk
order by dname, chunk;
DNAME CHUNK ENAMES
ACCOUNTING 1 CLARK,KING,MILLER
RESEARCH 1 ADAMS,FORD,JONES
RESEARCH 2 SCOTT,SMITH
SALES 1 ALLEN,BLAKE,JAMES
SALES 2 MARTIN,TURNER,WARD

 

Notice I am using LENGTHB to calculate the length in bytes of the input strings. COUNT(B.*) adds the lengths of the delimiters, assuming that the delimiter is one byte long. Lei uses LENGTHB even on the delimiter.

Before 12.1 we can get there with the MODEL clause:

select dname, chunk,
  listagg(ename, ',') within group(order by ename) enames
from (
  select * from dept join emp using(deptno)
  model
    partition by (dname)
    dimension by (row_number() over(partition by dname order by ename) rn)
    measures(ename, lengthb(ename) length_so_far, 1 chunk)
    rules(
      length_so_far[rn>1] = lengthb(ename[cv()]) +
        case
          when lengthb(ename[cv()]) + length_so_far[cv()-1] + 1 <= 20
          then length_so_far[cv()-1] + 1
          else 0
        end,
      chunk[rn>1] = chunk[cv()-1] +
        case
          when length_so_far[cv()] = lengthb(ename[cv()])
          then 1
          else 0
        end    
    )
)
group by dname, chunk
order by dname, chunk;
  • Lines 10-15 : the first RULE calculates the running length of the list that would be made, starting over once the list gets past the limit of 20 bytes.
  • Lines 16-21 : the second RULE increases the CHUNK number whenever the running length starts over.
  • Finally, LISTAGG groups by DNAME and CHUNK.

But I want just one long string!

In that case, you need a CLOB. For that, you don’t even need version 11.2, just version 11.1.

select dname,
SUBSTR(
  xmlcast(
    xmlagg(
      xmlelement(E, ',' || ename)
      order by ename
    ) AS CLOB  -- AS VARCHAR2(4000) would be the equivalent of LISTAGG
  ), 2
) enames
from dept join emp using(deptno)
group by dname
order by dname;

Elegant MODEL clause

Frank Kulash recently asked a challenging question on OTN, which Chris227 answered using the MODEL clause. To help  fully understand his excellent answer, I’m going to go into detail. (To concentrate on the modeling concepts, I’ve changed a few details in the problem and the solution.)

Teams Drafting Players

There are five teams that take turns choosing among eight players.

TEAM_NAME PLAYER_NAME
Atletico Mane
Burnley Neymar
Crystal Ronaldo
Dortmund Silva
Everton Weah
Xavi
Yorke
Zidane

 

The teams choose in alphabetical order. Once a player is picked, he is of course unavailable to be picked by another team. In fact, a team does not pick directly: it gives a rating to each player that it wants, and the available player with the highest rating is automatically assigned to the team. To simplify the problem, each team picks only once.

Here are the ratings, ordered by TEAM_NAME, RATING DESC, with the picks in bold.

TEAM_NAME PLAYER_NAME RATING
Atletico Mane 90
Atletico Neymar 80
Atletico Ronaldo 60
Burnley Silva 91
Burnley Weah 71
Burnley Zidane 51
Crystal Silva 82
Crystal Weah 72
Crystal Mane 62
Dortmund Silva 93
Dortmund Weah 83
Dortmund Mane 73
Everton Weah 94
Everton Silva 84
Everton Yorke 74
  • Atletico and Burney get their top rated players (Mane and Silva).
  • Crystal gets its second rated (Weah) because its first rated (Silva) is taken.
  • Dortmund has only rated three players and they are all taken, so Dortmund gets no one.
  • Everton gets its third rated player (Yorke) because the first two are taken.

For input, here is Frank’s table with the ratings for each team. Notice that (team_name, player_name) is unique, and (team_name, rating) is unique.

CREATE TABLE    choice (team_name, player_name, rating)    AS
SELECT  'Atletico', 'Mane',    90  FROM dual  UNION ALL
SELECT  'Atletico', 'Neymar',  80  FROM dual  UNION ALL      -- combination of
SELECT  'Atletico', 'Ronaldo', 60  FROM dual  UNION ALL      -- (team_name, player_name)
SELECT  'Burnley',  'Silva',   91  FROM dual  UNION ALL      -- is unique; so is
SELECT  'Burnley',  'Weah',    71  FROM dual  UNION ALL      -- (team_name, rating)
SELECT  'Burnley',  'Zidane',  51  FROM dual  UNION ALL
SELECT  'Crystal',  'Silva',   82  FROM dual  UNION ALL
SELECT  'Crystal',  'Weah',    72  FROM dual  UNION ALL
SELECT  'Crystal',  'Mane',    62  FROM dual  UNION ALL
SELECT  'Dortmund', 'Silva',   93  FROM dual  UNION ALL
SELECT  'Dortmund', 'Weah',    83  FROM dual  UNION ALL
SELECT  'Dortmund', 'Mane',    73  FROM dual  UNION ALL
SELECT  'Everton',  'Weah',    94  FROM dual  UNION ALL
SELECT  'Everton',  'Silva',   84  FROM dual  UNION ALL
SELECT  'Everton',  'Yorke',   74  FROM dual;

Starting to MODEL

select * from choice
model
dimension by ( team_name, player_name)
measures ( rating, 0 picked )
rules (
)
order by team_name, rating desc;

The MODEL clause turns the input data into a multi-dimensional array.

  • The DIMENSION clause lists the array indexes: here they are strings, not integers. The dimension columns identify each row uniquely, and their values cannot be changed.
  • The MEASURES clause lists columns whose values can be changed.
    A measure column in an individual row is called a cell: for example, RATING['Atletico', 'Mane']
  • The RULES clause lists rules that say what changes should be made. We’ll see rules in a bit.

Here we see what the data looks like before any rules are applied.

TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 0
Atletico Neymar 80 0
Atletico Ronaldo 60 0
Burnley Silva 91 0
Burnley Weah 71 0
Burnley Zidane 51 0
Crystal Silva 82 0
Crystal Weah 72 0
Crystal Mane 62 0
Dortmund Silva 93 0
Dortmund Weah 83 0
Dortmund Mane 73 0
Everton Weah 94 0
Everton Silva 84 0
Everton Yorke 74 0

 

What is a RULE?

A rule assigns a value to a cell. For example

PICKED['Atletico', 'Mane'] = 1

The left-hand side is the cell, and the right-hand side is the value assigned to that cell.

The left-hand side can refer to more than one cell:

  • PICKED['Atletico', ANY] = 1 will assign 1 to the PICKED cells for team ‘Atletico’, no matter who the player is;
  • PICKED[ANY, 'Mane'] = 1 will assign 1 to the PICKED cells for player ‘Mane’, no matter what the team is;
  • PICKED[ANY, ANY] = 1 will assign 1 to each and every PICKED cell.

The right-hand side contains a value, but that value can come from a cell. To use the same dimension values on the left-hand side and the right-hand side, we can use CV().

  • PICKED[ANY, ANY] = RATING[CV(),CV()] will assign to each PICKED cell the value RATING from the same row.

Each rule applies to only one cell at a time. If the left-hand side refers to several cells, the right-hand side is evaluated and assigned separately to each left-hand cell. The result of the evaluation must be one value, not a range of values.

Aggregations in Rules

On the right-hand side of a rule, we can aggregate over a range of cells to get one value.

  • PICKED[ANY, ANY] = max(RATING)[CV(),ANY] will assign to each PICKED cell the maximum value of RATING for that team. Notice the parentheses end before the dimension columns.

Ordering in Rules

Since each rule is applied to only one left-side cell at a time, the order of execution can be important. Look what happens with this rule:

picked[any,any] order by team_name, rating desc = max(picked)[any,any] + 1

TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 1
Atletico Neymar 80 2
Atletico Ronaldo 60 3
Burnley Silva 91 4
Burnley Weah 71 5
Burnley Zidane 51 6
Crystal Silva 82 7
Crystal Weah 72 8
Crystal Mane 62 9
Dortmund Silva 93 10
Dortmund Weah 83 11
Dortmund Mane 73 12
Everton Weah 94 13
Everton Silva 84 14
Everton Yorke 74 15

 

Do you see what happened here? For every left-side cell, the right-side expression is evaluated using the previously assigned values! That is why the order is so important.

The Solution to our problem

If we order the rows by team_name, rating desc then we can say for each row:

  • If the current row’s team has not picked
  • and the current row’s player has not been picked,
  • then pick that player for that team.
select * from choice
model
dimension by ( team_name, player_name)
measures ( rating, 0 picked )
rules (
  picked[any,any] order by team_name, rating desc = 
    case
      when max(picked)[any,cv()] = 0 and max(picked)[cv(),any] = 0
      then 1
      else 0
    end    
);
TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 1
Atletico Neymar 80 0
Atletico Ronaldo 60 0
Burnley Silva 91 1
Burnley Weah 71 0
Burnley Zidane 51 0
Crystal Silva 82 0
Crystal Weah 72 1
Crystal Mane 62 0
Dortmund Silva 93 0
Dortmund Weah 83 0
Dortmund Mane 73 0
Everton Weah 94 0
Everton Silva 84 0
Everton Yorke 74 1

 

Multiple Rounds

To let the teams pick more than one player, we can just add the ITERATE keyword. The ITERATION_NUMBER function starts with 0.

select * from choice
model
dimension by ( team_name, player_name)
measures ( rating, 0 picked )
rules iterate (3) (
  picked[any,any] order by team_name, rating desc = 
    case
      when max(picked)[any,cv()] = 0 and max(picked)[cv(),any] <= iteration_number
      then iteration_number + 1
      else picked[cv(),cv()]
    end    
);
TEAM_NAME PLAYER_NAME RATING PICKED
Atletico Mane 90 1
Atletico Neymar 80 2
Atletico Ronaldo 60 3
Burnley Silva 91 1
Burnley Weah 71 0
Burnley Zidane 51 2
Crystal Silva 82 0
Crystal Weah 72 1
Crystal Mane 62 0
Dortmund Silva 93 0
Dortmund Weah 83 0
Dortmund Mane 73 0
Everton Weah 94 0
Everton Silva 84 0
Everton Yorke 74 1

 

Did you make it this far?

If so, thanks for your patience! I thought this was a great example of the power of MODEL , because it uses ordering, aggregation and previously changed values in its evaluations – all in one pretty concise rule.

Hope this helps…

Missing cents when dividing an amount

Alex Nuijten (@alexnuijten) wrote a blog post on this subject a few years ago. Since his blog doesn’t accept comments at the moment, here is a variant I think is interesting.

The problem

As an example, say we have $100 that we want to divide into 42 parts.

  • ROUND(100/42,2) gives us $2.38
  • Multiplying $2.38 by 42 gives us $99.96
  • So to get a total of 100, we need to add one cent to four lines.

Adding cents to the first four lines

Here is Alex’s solution:

with amounts
as
( select level id
      ,100 amount
      ,count(*) over () entries
      ,round (100 / count(*) over (), 2) rounded
    from dual
    connect by level <= 42
)
,indicators as 
(
select id
      ,amount
      ,entries
      ,rounded
      ,case 
       when rownum <= abs ((amount - (entries * rounded)) * 100)
       then 'x'
       end as indicator
      ,sign (amount - (entries * rounded)) as pos_neg
 from amounts
)
select id
      ,rounded +
       (nvl2 (indicator, 0.01, 0) * pos_neg) final_amount
  from indicators
;
ID FINAL_AMOUNT
1 2.39
2 2.39
3 2.39
4 2.39
5 2.38
6 2.38
7 2.38
8 2.38
9 2.38

 

Adding cents using a running total

Personally, I would prefer to spread the extra cents more evenly across the lines. To do that, I simply round the current running total, then subtract the previous lines rounded running total. This turns out to be less code.

with data(id, amount, entries) as (
  select level, 100, count(*) over()
  from dual connect by level <= 42
)
select id, amount, entries,
  round(id*amount/entries, 2) - round((id-1)*amount/entries, 2) split
from data;

The lines with an extra cent are 6, 16, 27 and 37.

Hope this helps. Regards, Stew

My #OUGIreland19 Presentation on JSON

Thanks to those who came to my presentation today at OUG Ireland 2019. As promised, here is the link to the slides.

I had a great time at the conference, meeting new folks and spending time (but never enough) with the less new. The talks I attended were uniformly excellent. Special mention to Matt Mulvaney who again made me marvel at APEX, even though he showed how to go beyond what it can currently do.

Have safe trips home, everybody!

Best regards,
Stew

My #OUGN19 Presentations

As promised, here are the two presentations I gave on the OUGN 2019 Boat conference. I had a great time with old friends and new, and managed to have discussions about politics without anyone getting upset! Hope to see some of you again at OUG Ireland!

Please download the presentations to see the animations.

https://www.slideshare.net/stewashton/json-in-18c-and-19c

https://www.slideshare.net/stewashton/make-your-data-dance