Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>, "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>
Subject: Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-08 17:30:52
Message-ID: 1744767.1615224652@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joel Jacobson" <joel(at)compiler(dot)org> writes:
> I prefer to think of a match as two points. If the points are at the same position, it's a zero length match.

FWIW, I personally think that returning a start position and a length
would be the most understandable way to operate. If you report start
position and end position then there is always going to be confusion
over whether the end position is inclusive or exclusive (that is,
some code including our regex library thinks of the "end" as being
"first character after the match"). This is indeed the same
definitional issue you're contending with vis-a-vis range endpoints,
only now you lack any pre-existing definition that people might rely on
to know what you meant.

> Since there are currently zero composite type returning catalog functions, I can see why the idea of returning a "range" with two "start" and "stop" fields is controversial. There are probably good reasons that I fail to see why there are no composite type returning functions in the catalogs. Ideas on why this is the case, anyone?

Yeah: it's hard. The amount of catalog infrastructure needed by a
composite type is dauntingly large, and genbki.pl doesn't offer any
support for building composite types that aren't tied to catalogs.
(I suppose if you don't mind hacking Perl, you could try to refactor
it to improve that.) Up to now we've avoided the need for that,
since a function can be declared to return an anonymous record type
by giving it some OUT parameters. However, if I'm understanding
things correctly "regexp_positions(IN ..., OUT match_start integer,
OUT match_length integer) RETURNS SETOF record" wouldn't be enough
for you, because you really need a 2-D tableau of match data to
handle the case of multiple capturing parens plus 'g' mode. It
seems like you need it to return setof array(s), so the choices are
array of composite, 2-D array, or two parallel arrays. I'm not sure
the first of those is so much better than the others that it's worth
the pain involved to set up the initial catalog data that way.

BTW, I don't know if you know the history here, but regexp_matches()
is way older than regexp_match(); we eventually invented the latter
because the former was just too hard to use for easy non-'g' cases.
I'm inclined to think we should learn from that and provide equivalent
variants regexp_position[s] right off the bat.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ibrar Ahmed 2021-03-08 17:33:24 Re: WIP: System Versioned Temporal Table
Previous Message Justin Pryzby 2021-03-08 17:29:16 Re: [HACKERS] Custom compression methods