Re: range_agg

From: Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: range_agg
Date: 2020-03-11 23:04:40
Message-ID: CA+renyWdkaThzX3FKCVU6eV7k6=NJH862w2TjkQ2M7jftFoWWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks everyone for offering some thoughts on this!

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> have you given any thought to just deciding that ranges and
> multiranges are the same type?

I can see how it might be nice to have just one type to think about.
Still I think keeping them separate makes sense. Other folks have
brought up several reasons already. Just to chime in:

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Isaac Morland <isaac(dot)morland(at)gmail(dot)com> writes:
> > Definitely agreed that range and multirange (or whatever it's called)
> > should be different. In the work I do I have a number of uses for ranges,
> > but not (yet) for multiranges. I want to be able to declare a column as
> > range and be sure that it is just a single range, and then call lower() and
> > upper() on it and be sure to get just one value in each case; and if I
> > accidentally try to take the union of ranges where the union isn’t another
> > range, I want to get an error rather than calculate some weird (in my
> > context) multirange.
>
> I do not find that argument convincing at all. Surely you could put
> that constraint on your column using "CHECK (numranges(VALUE) <= 1)"
> or some such notation.

A check constraint works for columns, but there are other contexts
where you'd like to restrict things to just a contiguous range, e.g.
user-defined functions and intermediate results in queries. Basic
ranges seem a lot simpler to think about, so I can appreciate how
letting any range be a multirange adds a heavy cognitive burden. I
think a lot of people will share Isaac's opinion here.

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Also, this would allow us to remove at least one ugly misfeature:
>
> regression=# select '[1,2]'::int4range + '[3,10)'::int4range;
> ?column?
> ----------
> [1,10)
> (1 row)
>
> regression=# select '[1,2]'::int4range + '[4,10)'::int4range;
> ERROR: result of range union would not be contiguous

Because of backwards compatibility we can't really change +/-/* not to
raise (right?), so if we joined ranges and multiranges we'd need to
add operators with a different name. I was calling those @+/@-/@*
before, but that was considered too unintuitive and undiscoverable.
Having two types lets us use the nicer operator names.

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> it seems like we could consider the traditional
> range functions like lower() and upper() to report on the first or last
> range bound in a multirange

I tried to keep functions/operators similar, so already lower(mr) =
lower(r) and upper(mr) = upper(r).
I think *conceptually* it's good to make ranges & multiranges as
interchangable as possible, but that doesn't mean they have to be the
same type.

Adding multiranges-as-ranges also raises questions about their string
format. If a multirange is {[1,2), [4,5)} would you only print the
curly braces when there is more than one element?

I don't *think* allowing non-contiguous ranges would break how we use
them in GiST indexes or exclusion constraints, but maybe someone can
think of some problem I can't. It's one place to be wary anyway. At
the very least it would make those things slower I expect.

On a few other issues people have raised recently:

Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> I wonder what's the point of multirange arrays. Is there a reason we
> create those?

We have arrays of everything else, so why not have them for
multiranges? We don't have to identify specific use cases here,
although I can see how you'd want to call array_agg/UNNEST on some
multiranges, e.g. (Actually I really want to add an UNNEST that
*takes* a multirange, but that could be a follow-on commit.) If
nothing else I think omitting arrays of multiranges would be a strange
irregularity in the type system.

David G. Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> In the tests there is:
>
> +select '{[a,a],[b,b]}'::textmultirange;
> + textmultirange
> +----------------
> + {[a,a],[b,b]}
> +(1 row)
> +
> +-- without canonicalization, we can't join these:
> +select '{[a,a], [b,b]}'::textmultirange;
> + textmultirange
> +----------------
> + {[a,a],[b,b]}
> +(1 row)
> +
>
> Aside from the comment they are identical so I'm confused as to why both tests exist - though I suspect it has to do with the fact that the expected result would be {[a,b]} since text is discrete.

Those tests are for basic string parsing (multirange_in), so one is
testing {A,B} and the other {A, B} (with a space after the comma).
(There are some tests right above those that also have blank spaces,
but they only output a single element in the multirange result.)

David G. Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> Also, the current patch set seems a bit undecided on whether it wants to be truly a multi-range or a range that can report non-contiguous components. Specifically,
>
> +select '{[a,d), [b,f]}'::textmultirange;
> + textmultirange
> +----------------
> + {[a,f]}
> +(1 row)

Without a canonicalization function, we can't know that [a,a] touches
[b,b], but we *can* know that [a,d) touches [b,f). Or even:

regression=# select '{[a,b), [b,b]}'::textmultirange;
textmultirange
----------------
{[a,b]}
(1 row)

So I'm joining ranges whenever we know they touch. I think this is
consistent with existing range operators, e.g.:

regression=# select '[a,a]'::textrange -|- '[b,b]';
?column?
----------
f

regression=# select '[a,b)'::textrange -|- '[b,b]';
?column?
----------
t

David G. Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> There is a an argument that a multi-range should output {[a,d),[b,f]}. IMO its arguable that a multi-range container should not try and reduce the number of contained ranges at all.

Automatically combining touching ranges seems very desirable to me,
and one of the motivations to building a multirange type instead of
just using an array of ranges. Mathematically {[1,2), [2,3)} is
equivalent to {[1,3)}, and merging the touching elements keeps things
easier to read/understand and faster. Ranges themselves have a
canonical form too. Not canonicalizing raises a lot of questions, like
when are two "equivalent" ranges equal? And when you compose
operators/functions, do you keep all the internally-introduced splits,
or somehow preserve only splits that were present in your top-level
inputs? If you really want a list of possibly-touching ranges, I would
use an array for that. Why even have a range_agg (instead of just
array_agg'ing the ranges) if you're not going to merge the inputs?

Isaac Morland <isaac(dot)morland(at)gmail(dot)com> writes:
> On a related note, I was thinking about this and I don’t think I like
> range_agg as a name at all. I know we have array_agg and string_agg but surely
> shouldn’t this be called union_agg, and shouldn’t there also be an
> intersect_agg? I mean, taking the union isn’t the only possible aggregate on
> ranges or multiranges.

The patch does include a range_intersect_agg already. Since there are
so many set-like things in SQL, I don't think the unqualified
union_agg/intersect_agg are appropriate to give to ranges alone. And
the existing ${type}_agg functions are all "additive": json_agg,
json_object_agg, array_agg, string_agg, xmlagg. So I think range_agg
is the least-surprising name for this behavior. I'm not even the first
person to call it that, as you can see from [1].

David Fetter <david(at)fetter(dot)org> writes:
> One way to do that would be to include a "range cardinality" in the
> data structure which be the number of left ends in it.

I agree that is probably useful enough to add to this patch. I'll work on it.

David Fetter <david(at)fetter(dot)org> writes:
> There's another use case not yet covered here that could make this
> even more complex, we should probably plan for it: multi-ranges with
> weights.

Several people have asked me about this. I think it would need to be a
separate type though, e.g. weighted_multirange. Personally I wouldn't
mind working on it eventually, but I don't think it needs to be part
of this initial patch. Possibly it could even be an extension. In lieu
of a real type you also have an array-of-ranges, which is what I
originally proposed range_agg to return.

Finally, I think I mentioned this a long time ago, but I'm still not
sure if this patch needs work around these things:

- gist opclass
- spgist opclass
- typanalyze
- selectivity

I'd love for a real Postgres expert to tell me "No, we can add that
later" or "Yes, you have to add that now." Even better if they can
offer some help, because I'm not sure I understand those areas well
enough to do it myself.

Thanks all,
Paul

[1] https://git.proteus-tech.com/open-source/django-postgres/blob/fa91cf9b43ce942e84b1a9be22f445f3515ca360/postgres/sql/range_agg.sql

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Glukhov 2020-03-11 23:09:44 Re: SQL/JSON: functions
Previous Message Tom Lane 2020-03-11 22:33:42 Re: [PATCH] Use PKG_CHECK_MODULES to detect the libxml2 library