Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "daniel(at)yesql(dot)se" <daniel(at)yesql(dot)se>, "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>, "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>, "Gilles Darold" <gilles(at)darold(dot)net>, "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-12-25 11:02:29
Message-ID: d292520f-56d7-4f7e-a99c-d267f85129f2@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
> > [ 0005-regexp-positions.patch ]
>
> I took a quick look at this patch. I am on board now with the general
> idea of returning match positions, but I think there are still
> definitional issues to be discussed.
>
> 1. The main idea we might perhaps want to adopt from the Oracle-ish
> regexp functions is a search-start-position argument. I'm not sure
> that this is exciting --- if we intend this function to be a close
> analog of regexp_matches(), it'd be better to leave that off. But
> it deserves explicit consideration.

I intentionally made it a close analog of regexp_matches(),
to make it easy for existing users of regexp_matches() to
understand how regexp_positions() works.

> 2. It looks like you modeled this on regexp_matches() to the extent
> of returning a set and using the 'g' flag to control whether the
> set can actually contain more than one row. That's pretty ancient
> precedent ... but we have revisited that design more than once
> because regexp_matches() is just too much of a pain to use. I think
> if we're going to do this, we should learn from history, and provide an
> analog to regexp_match() as well as regexp_matches() right off the bat.

Yes, I modeled it on regexp_matches().
OK, so what you are suggesting is we should add both a regexp_position() function,
that would work like regexp_match() but return the position instead,
in addition to the already suggested regexp_positions() function.

That sounds like a good idea to me.

> 3. The API convention you chose (separate start and length arrays)
> is perhaps still confusing. When I first looked at the test case
>
> +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
> + regexp_positions
> +-------------------
> + ("{4,7}","{3,5}")
> +(1 row)
>
> I thought it was all wrong because it seemed to be identifying
> the substrings 'barbequ' and 'obarb'. If there'd been a different
> number of matches than capture groups, maybe I'd not have been
> confused, but still... I wonder if we'd be better advised to make
> N capture groups produce N two-element arrays, or else mash it all
> into one array of N*2 elements. But this probably depends on which
> way is the easiest to work with in SQL.

The drawbacks of two-element arrays have already been discussed up thread.
Personally, I prefer the version suggested in the latest patch, and suggest we stick to it.

> 4. Not sure about the handling of sub-matches.
> There are various plausible definitions we could use:
>
> * We return the position/length of the overall match, never mind
> about any parenthesized subexpressions. This is simple but I think
> it loses significant functionality. As an example, you might have
> a pattern like 'ab*(c*)d+' where what you actually want to know
> is where the 'c's are, but they have to be in the context described
> by the rest of the regexp. Without subexpression-match capability
> that's painful to do.
>
> * If there's a capturing subexpression, return the position/length
> of the first such subexpression, else return the overall match.
> This matches the behavior of substring().
>
> * If there are capturing subexpression(s), return the
> positions/lengths of those, otherwise return the overall match.
> This agrees with the behavior of regexp_match(es), so I'd tend
> to lean to this option, but perhaps it's the hardest to use.
>
> * Return the position/length of the overall match *and* those of
> each capturing subexpression. This is the most flexible choice,
> but I give it low marks since it matches no existing behaviors.

The existing behaviour of regexp_match(es) is perhaps a bit surprising to new users,
but I think one quickly learn the semantics by trying out a few examples,
and once understood it's at least not something that bothers me personally.

I think it's best to let regexp_position(s) work the same way as regexp_match(es),
since otherwise users would have to learn and remember two different behaviours.

> As for comments on the patch itself:
>
> * The documentation includes an extraneous entry for regexp_replace,
> and it fails to add the promised paragraph to functions-posix-regexp.
>
> * This bit is evidently copied from regexp_matches:
>
> + /* be sure to copy the input string into the multi-call ctx */
> + matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
>
> regexp_matches needs to save the input string so that
> build_regexp_match_result can copy parts of that. But
> regexp_positions has no such need AFAICS, so I think we
> don't need a long-lived copy of the string.
>
> * I wouldn't use these names in build_regexp_positions_result:
>
> + ArrayType *so_ary;
> + ArrayType *eo_ary;
>
> "so_ary" isn't awful, but it invites confusion with regex's "so"
> field, which hasn't got the same semantics (off by one).
> "eo_ary" is pretty bad because it isn't an "end offset" at all, but
> a length. I'd go for "start_ary" and "length_ary" or some such.
>
> * Test cases seem a bit thin, notably there's no coverage of the
> null-subexpression code path.

Thanks for the comments on the patch itself.
I will work on fixing these, but perhaps we can first see if it's possible to reach a consensus on the API convention and behaviour.

/Joel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2021-12-25 13:35:30 Re: [PATCH] reduce page overlap of GiST indexes built using sorted method
Previous Message Amit Kapila 2021-12-25 04:20:44 Re: row filtering for logical replication