Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>, "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>, "Gilles Darold" <gilles(at)darold(dot)net>, "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-09 19:30:21
Message-ID: bf2222d5-909d-408b-8531-95b32f18d4ab@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 9, 2021, at 17:42, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
> > Tom - can you please give details on your unpleasant experiences with parallel arrays?
>
> The problems I can recall running into were basically down to not having
> an easy way to iterate through parallel arrays. There are ways to do
> that in SQL, certainly, but they all constrain how you write the query,
> and usually force ugly stuff like splitting it into sub-selects.

I see now what you mean, many thanks for explaining.

>
> As an example, presuming that regexp_positions is defined along the
> lines of
>
> regexp_positions(str text, pat text, out starts int[], out lengths int[])
> returns setof record

+1

I think this is the most feasible best option so far.

Attached is a patch implementing it this way.

I changed the start to begin at 1, since this is how position ( substring text IN string text ) → integer works.

SELECT * FROM regexp_positions('foobarbequebaz', '^', 'g');
starts | lengths
--------+---------
{1} | {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', 'ba.', 'g');
starts | lengths
--------+---------
{4} | {3}
{12} | {3}
(2 rows)

Mark's examples:

SELECT * FROM regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
starts | lengths
--------+---------
{7} | {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', '(?<=z)', 'g');
starts | lengths
--------+---------
{15} | {0}
(1 row)

I've also tested your template queries:

>
> then to actually get the identified substrings you'd have to do something
> like
>
> select
> substring([input string] from starts[i] for lengths[i])
> from
> regexp_positions([input string], [pattern]) r,
> lateral
> generate_series(1, array_length(starts, 1)) i;

select
substring('foobarbequebaz' from starts[i] for lengths[i])
from
regexp_positions('foobarbequebaz', 'ba.', 'g') r,
lateral
generate_series(1, array_length(starts, 1)) i;

substring
-----------
bar
baz
(2 rows)

> I think the last time I confronted this, we didn't have multi-array
> UNNEST. Now that we do, we can get rid of the generate_series(),
> but it's still not beautiful:
>
> select
> substring([input string] from s for l)
> from
> regexp_positions([input string], [pattern]) r,
> lateral
> unnest(starts, lengths) u(s,l);

select
substring('foobarbequebaz' from s for l)
from
regexp_positions('foobarbequebaz', 'ba.', 'g') r,
lateral
unnest(starts, lengths) u(s,l);

substring
-----------
bar
baz
(2 rows)

> Having said that, the other alternative with a 2-D array:
>
> regexp_positions(str text, pat text) returns setof int[]
>
> seems to still need UNNEST, though now it's not the magic multi-array
> UNNEST but this slicing version:
>
> select
> substring([input string] from u[1] for u[2])
> from
> regexp_positions([input string], [pattern]) r,
> lateral
> unnest_slice(r, 1) u;

Unable to test this one since there is no unnest_slice() (yet)

>
> Anyway, I'd counsel trying to write out SQL implementations
> of regexp_matches() and other useful things based on any
> particular regexp_positions() API you might be thinking about.
> Can we do anything useful without a LATERAL UNNEST thingie?
> Are some of them more legible than others?

Hmm, I cannot think of a way.

/Joel

Attachment Content-Type Size
0004-regexp-positions.patch application/octet-stream 9.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2021-03-09 19:33:47 Re: Procedures versus the "fastpath" API
Previous Message Matthias van de Meent 2021-03-09 19:28:19 Re: Lowering the ever-growing heap->pd_lower