Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>
Subject: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-05 19:46:24
Message-ID: 57c492b8-e8f0-44a6-926e-14a2f2f766c6@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 2, 2021, at 01:12, Mark Dilger wrote:
> I like the idea so I did a bit of testing. I think the following should not error, but does:
>
> +SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
> +ERROR: range lower bound must be less than or equal to range upper bound

Doh! How stupid of me. I realize now I had a off-by-one thinko in my 0001 patch using int4range.

I didn't use the raw "so" and "eo" values in regexp.c like I should have,
instead, I incorrectly used (so + 1) as the startpos,
and just eo as the endpos.

This is what caused all the problems.

The fix is simple:
- lower.val = Int32GetDatum(so + 1);
+ lower.val = Int32GetDatum(so);

The example that gave the error now works properly:

SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
regexp_positions
------------------
{"[6,7)"}
(1 row)

I've also created a SQL PoC of the composite range type idea,
and convenience wrapper functions for int4range and int8range.

CREATE TYPE range AS (start int8, stop int8);

Helper functions:
range(start int8, stop int8) -> range
range(int8range) -> range
range(int4range) -> range
range(int8range[]) -> range[]
range(int4range[]) -> range[]

Demo:

regexp_positions() returns setof int4range[]:

SELECT r FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
r
-----------------------
{"[3,7)","[6,12)"}
{"[11,17)","[16,21)"}
(2 rows)

Convert int4range[] -> range[]:

SELECT range(r) FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
range
-----------------------
{"(3,6)","(6,11)"}
{"(11,16)","(16,20)"}
(2 rows)

"start" and "stop" fields:

SELECT (range(r[1])).* FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
start | stop
-------+------
3 | 6
11 | 16
(2 rows)

zero-length match at beginning:

SELECT r FROM regexp_positions('','^','g') AS r;
r
-----------
{"[0,1)"}
(1 row)

SELECT (range(r[1])).* FROM regexp_positions('','^','g') AS r;
start | stop
-------+------
0 | 0
(1 row)

My conclusion is that we should use setof int4range[] as the return value for regexp_positions().

New patch attached.

The composite range type and helper functions are of course not at all necessary,
but I think they would be a nice addition, to make it easier to work with ranges
for composite types. I intentionally didn't create anyrange versions of them,
since they can only support composite types,
since they don't require the inclusive/exclusive semantics.

/Joel

Attachment Content-Type Size
range.sql application/octet-stream 1.5 KB
0003-regexp-positions.patch application/octet-stream 8.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2021-03-05 20:04:50 Re: Nicer error when connecting to standby with hot_standby=off
Previous Message Jacob Champion 2021-03-05 19:11:12 Re: PROXY protocol support