Re: [PATCH] Support empty ranges with bounds information

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>, "Isaac Morland" <isaac(dot)morland(at)gmail(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: [PATCH] Support empty ranges with bounds information
Date: 2021-03-02 17:52:09
Message-ID: 9e4e979d-9232-4f17-849e-c9e7bc11b971@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 2, 2021, at 15:42, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
> > As discussed in the separate thread "[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]" [1]
> > it's currently not possible to create an empty range with bounds information.
>
> > This patch tries to improve the situation by keeping the bounds information,
> > and allow accessing it via lower() and upper().
>
> I think this is an actively bad idea. We had a clean set-theoretic
> definition of ranges as sets of points, and with this we would not.
> We should not be whacking around the fundamental semantics of a
> whole class of data types on the basis that it'd be cute to make
> regexp_position return its result as int4range rather than int[].

I think there are *lots* of other use-cases where the current semantics of range types are very problematic.

The regexp_positions() patch just demonstrates one concrete example
on when real-life zero-length ranges can definitively have positions,
regardless of what the mathematicians thinks.
(I use the term "position" here since if lower=upper bound,
then we're talking about a position, since it has zero length.)

I think there is a risk lots of users coming from other programming environments
will misunderstand how ranges work, start implementing something using them,
only to later have to rewrite all their code using ranges due to eventually encountering the
zero-length corner-case for which there is no work-around (except not using ranges).

Imagine e.g. a Rust user, who has learned how ranges work in Rust,
and thinks the program below is valid and and expects it to output
"start 3 end 3 is_empty true".

fn main() {
let r = std::ops::Range { start: 3, end: 3 };

println!(
"start {} end {} is_empty {}",
r.start,
r.end,
r.is_empty()
);
}

I think it would be a challenge to explain how PostgreSQL's range semantics
to this user, why you get NULL when trying to
access the start and end values for this range.

I feel this is a perfect example of when theory and practise has since long parted ways,
and the theory is only cute until you face the ugly reality.

That said, subtle changes are of course possibly dangerous,
and since I'm not a huge range type user myself,
I can't have an opinion on how many rely on the current null semantics for lower() and upper().

Argh! I wish we would have access to a large set of real-life real-time statistics on PostgreSQL SQL queries
and result sets from lots of different users around the world, similar to the regex test corpus.
It's very unfair e.g. Amazon with their Aurora could in theory collect such statistics on all their users,
so their Aurora-hackers could answers questions like

"Do lower() and upper() ever return NULL in real-life for ranges?"

While all we can do is to rely on user reports and our imagination.
Maybe we should allow opting-in to contribute with statistics from production servers,
to help us better understand how PostgreSQL is used in real-life?
I see lots of problems with data privacy, business secrets etc, but perhaps there are something that can be done.

Oh well. At least it was fun to learn about how ranges are implemented behind the scenes.

If we cannot do a subtle change, then I think we should consider an entirely new range class,
just like multi-ranges are added in v14. Maybe "negrange" could be a good name?

>
> If we did go forward with this, what would the implications be for
> multiranges?

None. It would only affect lower()/upper() for a single range created with bounds.

Before patch:

SELECT numrange(3,4) + numrange(5,5);
[3,4)
SELECT upper(numrange(3,4) + numrange(5,5));
4
SELECT numrange(5,5);
empty
SELECT upper(numrange(5,5));
NULL

After patch:

SELECT numrange(3,4) + numrange(5,5);
[3,4)
SELECT upper(numrange(3,4) + numrange(5,5));
4
SELECT numrange(5,5);
empty
SELECT upper(numrange(5,5));
5

At the very least, I think we should in any case add test coverage of what lower()/upper() returns for empty ranges.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-03-02 18:01:44 Re: [PATCH] Support empty ranges with bounds information
Previous Message Jacob Champion 2021-03-02 17:48:05 Re: DETAIL for wrong scram password