Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>
Subject: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-08 17:05:15
Message-ID: d8f6e621-4e46-4af7-a011-7c575fc28ac4@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 8, 2021, at 17:20, Mark Dilger wrote:
> > On Mar 5, 2021, at 11:46 AM, Joel Jacobson <joel(at)compiler(dot)org> wrote:
> > <range.sql><0003-regexp-positions.patch>
>
> I did a bit more testing:
>
> +SELECT regexp_positions('foobarbequebaz', 'b', 'g');
> + regexp_positions
> +------------------
> + {"[3,5)"}
> + {"[6,8)"}
> + {"[11,13)"}
> +(3 rows)
> +
>
> I understand that these ranges are intended to be read as one character long matches starting at positions 3, 6, and 11, but they look like they match either two or three characters, depending on how you read them, and users will likely be confused by that.
>
> +SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g');
> + regexp_positions
> +------------------
> + {"[6,7)"}
> +(1 row)
> +
>
> This is a zero length match. As above, it might be confusing that a zero length match reads this way.
>
> +SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g');
> + regexp_positions
> +------------------
> + {"[14,15)"}
> +(1 row)
> +
>
> Same here, except this time position 15 is referenced, which is beyond the end of the string.
>
> I think a zero length match at the end of this string should read as {"[14,14)"}, and you have been forced to add one to avoid that collapsing down to "empty", but I'd rather you found a different datatype rather than abuse the definition of int4range.
>
> It seems that you may have reached a similar conclusion down-thread?

This is due to the, in my opinion, unfortunate decision of using inclusive/exclusive as the canonical form for discrete types.
Probably not much we can do about that, but that's what we have, so I think it's fine.

[6,7) is exactly the same thing as [6,6] for discrete types, it simply means the startpos and endpos both are 6.

I prefer to think of a match as two points. If the points are at the same position, it's a zero length match.

In the example, the startpos and endpos are both at 6, so it's a zero length match.,

This was very confusing to me at first. I wrongly thought I needed an empty int4range and had the perverse idea of hacking the range type to allow setting lower and upper even though it was empty. This was a really really bad idea which I feel stupid of even considering. It was before I understood a zero length match should actually *not* be represented as an empty int4range, but as an int4range covering exactly one single integer, since startpos=endpos. This was due to my off-by-one error. With that fixed, the only problem is the (in my opinion) unnatural canonical form for discrete types, since in this context it's just silly to talk about inclusive/exclusive. I think inclusive/inclusive would have been much more SQL idiomatic, since that's the semantics for BETWEEN in SQL, it's inclusive/inclusive. So is most other programming environments I've seen.

However, not much we can do about that for int4range/int8range,
but maybe multirange could change the canonical form going forward.

Even if not changed, I think int4range works just fine. It just requires a bit more mental effort to understand what the values mean. Probably an annoyance for users at first, but I think they easily will understand they should just do "-1" on the upper() value (but only if upper_inc() is FALSE, but you know that for sure for int4ranges, so it is really necessary, one might wonder).

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not controversial.

Since there are currently zero composite type returning catalog functions, I can see why the idea of returning a "range" with two "start" and "stop" fields is controversial. There are probably good reasons that I fail to see why there are no composite type returning functions in the catalogs. Ideas on why this is the case, anyone?

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2021-03-08 17:06:32 Re: Yet another fast GiST build
Previous Message Magnus Hagander 2021-03-08 17:03:59 Re: pg_stat_statements oddity with track = all