Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gilles Darold <gilles(at)darold(dot)net>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, er(at)xs4all(dot)nl, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Date: 2021-08-01 23:21:55
Message-ID: 1567465.1627860115@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gilles Darold <gilles(at)darold(dot)net> writes:
> [ v5-0001-regexp-foo-functions.patch ]

I've gone through this whole patch now, and found quite a lot that I did
not like. In no particular order:

* Wrapping parentheses around the user's regexp doesn't work. It can
turn an invalid regexp into a valid one: for example 'a)(b' should draw
a syntax error. With this patch, no error would be thrown, but the
"outer" parens wouldn't do what you expected. Worse, it can turn a
valid regexp into an invalid one: the metasyntax options described in
9.7.3.4 only work at the start of the regexp. So we have to handle
whole-regexp cases honestly rather than trying to turn them into an
instance of the parenthesized-subexpression case.

* You did a lot of things quite inefficiently, apparently to avoid
touching any existing code. I think it's better to extend
setup_regexp_matches() and replace_text_regexp() a little bit so that
they can support the behaviors these new functions need. In both of
them, it's absolutely trivial to allow a search start position to be
passed in; and it doesn't take much to teach replace_text_regexp()
to replace only the N'th match.

* Speaking of N'th, there is not much of anything that I like
about Oracle's terminology for the function arguments, and I don't
think we ought to adopt it. If we're documenting the functions as
processing the "N'th match", it seems to me to be natural to call
the parameter "N" not "occurrence". Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell.
Likewise, "position" is a horribly vague term for the search start
position; it could be interpreted to mean several other things.
"start" seems much better. "return_opt" is likewise awfully unclear.
I went with "endoption" below, though I could be talked into something
else. The only one of Oracle's choices that I like is "subexpr" for
subexpression number ... but you went with DB2's rather vague "group"
instead. I don't want to use their "capture group" terminology,
because that appears nowhere else in our documentation. Our existing
terminology is "parenthesized subexpression", which seems fine to me
(and also agrees with Oracle's docs).

* I spent a lot of time on the docs too. A lot of the syntax specs
were wrong (where you put the brackets matters), many of the examples
seemed confusingly overcomplicated, and the text explanations needed
copy-editing.

* Also, the regression tests seemed misguided. This patch is not
responsible for testing the regexp engine as such; we have tests
elsewhere that do that. So I don't think we need complex regexps
here. We just need to verify that the parameters of these functions
act properly, and check their error cases. That can be done much
more quickly and straightforwardly than what you had.

So here's a revised version that I like better. I think this
is pretty nearly committable, aside from the question of whether
a too-large subexpression number should be an error or not.

regards, tom lane

Attachment Content-Type Size
v6-0001-regexp-foo-functions.patch text/x-diff 60.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-08-02 00:01:33 Re: A qsort template
Previous Message Peter Smith 2021-08-01 22:49:24 Re: Corrected documentation of data type for the logical replication message formats.