Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gilles Darold <gilles(at)darold(dot)net>
Cc: Gilles Darold <gillesdarold(at)gmail(dot)com>, Chapman Flack <chap(at)anastigmatix(dot)net>, er(at)xs4all(dot)nl, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Date: 2021-08-01 17:23:13
Message-ID: 1456640.1627838593@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been working through this patch, and trying to verify
compatibility against Oracle and DB2, and I see some points that need
discussion or at least recording for the archives.

* In Oracle, while the documentation for regexp_instr says that
return_option should only be 0 or 1, experimentation with sqlfiddle
shows that any nonzero value is silently treated as 1. The patch
raises an error for other values, which I think is a good idea.
(IBM's docs say that DB2 raises an error too, though I can't test
that.) We don't need to be bug-compatible to that extent.

* What should happen when the subexpression/capture group number of
regexp_instr or regexp_substr exceeds the number of parenthesized
subexpressions of the regexp? Oracle silently returns a no-match
result (0 or NULL), as does this patch. However, IBM's docs say
that DB2 raises an error. I'm inclined to think that this is
likewise taking bug-compatibility too far, and that we should
raise an error like DB2. There are clearly cases where throwing
an error would help debug a faulty call, while I'm less clear on
a use-case where not throwing an error would be useful.

* IBM's docs say that both regexp_count and regexp_like have
arguments "string, pattern [, start] [, flags]" --- that is,
each of start and flags can be independently specified or omitted.
The patch follows Oracle, which has no start option for
regexp_like, and where you can't write flags for regexp_count
without writing start. This is fine by me, because doing these
like DB2 would introduce the same which-argument-is-this issues
as we're being forced to cope with for regexp_replace. I don't
think we need to accept ambiguity in these cases too. But it's
worth memorializing this decision in the thread.

* The patch has most of these functions silently ignoring the 'g'
flag, but I think they should raise errors instead. Oracle doesn't
accept a 'g' flag for these, so why should we? The only case where
that logic doesn't hold is regexp_replace, because depending on which
syntax you use the 'g' flag might or might not be meaningful. So
for regexp_replace, I'd vote for silently ignoring 'g' if the
occurrence-number parameter is given, while honoring it if not.

I've already made changes in my local copy per the last item,
but I've not done anything about throwing errors for out-of-range
subexpression numbers. Anybody have an opinion about that one?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-08-01 17:59:18 Re: slab allocator performance issues
Previous Message vignesh C 2021-08-01 15:32:16 Re: Corrected documentation of data type for the logical replication message formats.