[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>
Subject: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-01 17:38:18
Message-ID: 9b2fe34e-b666-471d-8297-f0c82b133085@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I suggest adding a new function, regexp_positions(),
which works exactly like regexp_matches(),
except it returns int4range[] start/end positions for *where* the matches occurs.

I first thought I could live without this function,
and just get the positions using strpos(),
but as Andreas Karlsson kindly helped me understand,
that naive idea doesn't always work.

Andreas provided this pedagogic example
to demonstrate the problem:

SELECT regexp_matches('caaabaaa', '(?<=b)(a+)', 'g');
regexp_matches
----------------
{aaa}
(1 row)

If we would try to use strpos() to find the position,
based on the returned matched substring,
we would get the wrong answer:

SELECT strpos('caaabaaa','aaa');
strpos
--------
2
(1 row)

Sure, there is "aaa" at position 2,
but that's not where the match occurred,
since the (?<=b) means "positive lookbehind of the character b",
so the match actually occurred at position 6,
where there is a "b" before the "aaa".

Using regexp_positions(), we can now get the correct answer:

SELECT regexp_positions('caaabaaa', '(?<=b)(a+)', 'g');
regexp_positions
------------------
{"[6,9)"}
(1 row)

Some more examples from the regress/sql/strings.sql,
showing both regexp_matches() and regexp_positions()
for the same examples, as they both return the same structure,
but with different types:

SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)

SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
regexp_positions
-----------------------
{"[4,7)","[7,12)"}
{"[12,17)","[17,21)"}
(2 rows)

I've added documentation and tests.

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

A few words on the implementation:
I copied build_regexp_match_result() to a new function build_regexp_positions_result(),
and removed the string parts, replacing it with code to make ranges instead.
Maybe there are common parts that could be put into some helper-function,
but I think not, since the functions are two small for it to be worth it.

Thanks to David Fetter for the idea on using ranges.

Based on HEAD (f5a5773a9dc4185414fe538525e20d8512c2ba35).

/Joel

Attachment Content-Type Size
0001-regexp-positions.patch application/octet-stream 8.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-03-01 17:47:57 Re: Add --tablespace option to reindexdb
Previous Message Mark Dilger 2021-03-01 17:26:03 Re: Add --tablespace option to reindexdb