Re: range_agg

From: Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: range_agg
Date: 2019-07-05 16:58:02
Message-ID: CA+renyX7jsWnyry6SOn-WSgdgQLtyFJ7SLWBVs+oatwQsMzvog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 1, 2019 at 3:38 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> For getting into core though, it should be a more complete set of
> related operations. The patch is implicitly introducing the concept of
> a "multirange" (in this case, an array of ranges), but it's not making
> the concept whole.
>
> What else should return a multirange? This patch implements the union-
> agg of ranges, but we also might want an intersection-agg of ranges
> (that is, the set of ranges that are subranges of every input). Given
> that there are other options here, the name "range_agg" is too generic
> to mean union-agg specifically.

Thanks for the review!

I like the idea of adding a multirange type that works like range
types, although I'm not sure I want to build it. :-)

My own motivations for the range_agg patch are for temporal databases,
where I have two use-cases: checking temporal foreign keys [1] and
doing a Snodgrass "coalesce" operation [2]. The FKs use-case is more
important. For coalesce I just immediately UNNEST the array, so a
multirange would sort of "get in the way". It's no big deal if I can
cast a multirange to an array, although for something that would run
on every INSERT/UPDATE/DELETE I'd like to understand what the cast
would cost us in performance. But coalesce isn't part of SQL:2011 and
should be optional behavior or just something in an extension. The FKs
use case matters to me a lot more, and I think a multirange would be
fine for that. Also a multirange would solve the generics problem
Pavel asked about. (I'll say more about that in a reply to him.)

I'm not convinced that an intersection aggregate function for
multiranges would be used by anyone, but I don't mind building one.
Every other *_agg function has an "additive" sense, not a
"subtractive" sense. json{,b}_object_agg are the closest since you
*could* imagine intersection semantics for those, but they are unions.
So in terms of *naming* I think using "range_agg" for union semantics
is natural and would fit expectations. (I'm not the first to name this
function range_agg btw: [3]).

But there is clearly more than one worthwhile way to aggregate ranges:

- David suggested weighted_range_agg and covering_range_agg. At PGCon
2019 someone else said he has had to build something that was
essentially weighted_range_agg. I can see it used for
scheduling/capacity/max-utilization problems.
- You suggested an intersection range_agg.
- At [4] there is a missing_ranges function that only gives the *gaps*
between the input ranges.

Nonetheless I still think I would call the union behavior simply
range_agg, and then use weighted_range_agg, covering_range_agg,
intersection_range_agg, and missing_range_agg for the rest (assuming
we built them all). I'm not going to insist on any of that, but it's
what feels most user-friendly to me.

> What can we do with a multirange? A lot of range operators still make
> sense, like "contains" or "overlaps"; but "adjacent" doesn't quite
> work. What about new operations like inverting a multirange to get the
> gaps?

I can think of a lot of cool operators for `multirange \bigoplus
multirange` or `multirange \bigoplus range` (commuting of course). And
I've certainly wanted `range + range` and `range - range` in the past,
which would both return a multirange.

> If we have a more complete set of operators here, the flags for
> handling overlapping ranges and gaps will be unnecessary.

Both of my use cases should permit overlaps & gaps (range_agg(r, true,
true)), so I'm actually pretty okay with dropping the flags entirely
and just giving a one-param function that behavior. Or defining a
strict_range_agg that offers more control. But also I don't understand
how richer *operators* make the flags for the *aggregate* unnecessary.

So I really like the idea of multiranges, but I'm reluctant to take it
on myself, especially since this patch is just a utility function for
my other temporal patches. But I don't want my rush to leave a blemish
in our standard library either. But I think what really persuades me
to add multiranges is making a range_agg that more easily supports
user-defined range types. So how about I start on it and see how it
goes? I expect I can follow the existing code for range types pretty
closely, so maybe it won't be too hard.

Another option would be to rename my function range_array_agg (or
something) so that we are leaving space for a multirange function in
the future. I don't love this idea myself but it would could a Plan B.
What do you think of that?

Regards,
Paul

[1] With range_agg I can make the temporal FK check use similar SQL to
the non-temporal FK check:

SELECT 1
FROM [ONLY] <pktable> x
WHERE pkatt1 = $1 [AND ...]
FOR KEY SHARE OF x

vs

SELECT 1
FROM (
SELECT range_agg(r, true, true) AS r
FROM (
SELECT pkperiodatt AS r
FROM [ONLY] pktable x
WHERE pkatt1 = $1 [AND ...]
FOR KEY SHARE OF x
) x1
) x2
WHERE $n <@ x2.r[1]

(The temporal version could be simplified a bit if FOR KEY SHARE ever
supports aggregate functions.)

[2] page number 159ff in https://www2.cs.arizona.edu/~rts/tdbbook.pdf
(section 6.5.1, starts on page 183 of the PDF)

[3] https://git.proteus-tech.com/open-source/django-postgres/blob/fa91cf9b43ce942e84b1a9be22f445f3515ca360/postgres/sql/range_agg.sql

[4] https://schinckel.net/2014/11/18/aggregating-ranges-in-postgres/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2019-07-05 16:58:30 Re: Fix runtime errors from -fsanitize=undefined
Previous Message Jeff Davis 2019-07-05 16:48:29 Re: range_agg