Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joel Jacobson <joel(at)compiler(dot)org>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Fetter <david(at)fetter(dot)org>, Gilles Darold <gilles(at)darold(dot)net>
Subject: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-04 16:41:10
Message-ID: 60410DA6.3050501@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/04/21 10:40, Tom Lane wrote:
> Also, did you see
>
> https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net
>
> Seems like there may be some overlap in these proposals.

Not only that, the functions in that other proposal are very similar
to the standard's own functions that are specified to use XML Query
regular expression syntax (sample implementations in [1]).

These differently-named (which is good) functions seem to be a de facto
standard where the regexp syntax and semantics are those native to the
DBMS, the correspondence being

de facto ISO XQuery-based
-------------- ------------------
regexp_like like_regex
regexp_count occurrences_regex
regexp_instr position_regex
regexp_substr substring_regex
regexp_replace translate_regex

The regexp_positions proposal highlights an interesting apparent gap in
both the de facto and the ISO specs: the provided functions allow you
to specify which occurrence you're talking about, and get the corresponding
positions or the corresponding substring, but neither set of functions
includes one to just give you all the matching positions at once as
a SETOF something.

What the proposed regexp_positions() returns is pretty much exactly
the notional "list of match vectors" that appears internally throughout
the specs of the ISO functions, but is never directly exposed.

In the LOMV as described in the standard, the position/length arrays
are indexed from zero, and the start and length at index 0 are those
for the overall match as a whole.

Right now, if you have a query that involves, say,

substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 1) and also
substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 2),

a naïve implementation like [1] will of course compile and evaluate
the regexp twice and return one group each time. It makes me wonder
whether the standards committee was picturing a clever parse analyzer
and planner that would say "aha! you want group 1 and group 2 from
a single evaluation of this regex!", and that might even explain the
curious rule in the standard that the regex must be an actual literal,
not any other expression. (Still, that strikes me as an awkward way to
have to write it, spelling the regex out as a literal, twice.)

It has also made my idly wonder how close we could get to behaving
that way, perhaps with planner support functions and other available
parse analysis/planning hooks. Would any of those mechanisms get a
sufficiently global view of the query to do that kind of rewriting?

Regards,
-Chap

[1]
https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2021-03-04 16:41:21 Re: Confusing behavior of psql's \e
Previous Message John Naylor 2021-03-04 16:30:45 Re: WIP: BRIN multi-range indexes