Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From: Gilles Darold <gilles(at)darold(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, er(at)xs4all(dot)nl, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Date: 2021-08-01 19:48:00
Message-ID: 366af4de-b78e-ef76-5c7a-3d8a16567168@darold.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Le 01/08/2021 à 19:23, Tom Lane a écrit :
> I've been working through this patch, and trying to verify
> compatibility against Oracle and DB2, and I see some points that need
> discussion or at least recording for the archives.
>
> * In Oracle, while the documentation for regexp_instr says that
> return_option should only be 0 or 1, experimentation with sqlfiddle
> shows that any nonzero value is silently treated as 1. The patch
> raises an error for other values, which I think is a good idea.
> (IBM's docs say that DB2 raises an error too, though I can't test
> that.) We don't need to be bug-compatible to that extent.
>
> * What should happen when the subexpression/capture group number of
> regexp_instr or regexp_substr exceeds the number of parenthesized
> subexpressions of the regexp? Oracle silently returns a no-match
> result (0 or NULL), as does this patch. However, IBM's docs say
> that DB2 raises an error. I'm inclined to think that this is
> likewise taking bug-compatibility too far, and that we should
> raise an error like DB2. There are clearly cases where throwing
> an error would help debug a faulty call, while I'm less clear on
> a use-case where not throwing an error would be useful.
>
> * IBM's docs say that both regexp_count and regexp_like have
> arguments "string, pattern [, start] [, flags]" --- that is,
> each of start and flags can be independently specified or omitted.
> The patch follows Oracle, which has no start option for
> regexp_like, and where you can't write flags for regexp_count
> without writing start. This is fine by me, because doing these
> like DB2 would introduce the same which-argument-is-this issues
> as we're being forced to cope with for regexp_replace. I don't
> think we need to accept ambiguity in these cases too. But it's
> worth memorializing this decision in the thread.
>
> * The patch has most of these functions silently ignoring the 'g'
> flag, but I think they should raise errors instead. Oracle doesn't
> accept a 'g' flag for these, so why should we? The only case where
> that logic doesn't hold is regexp_replace, because depending on which
> syntax you use the 'g' flag might or might not be meaningful. So
> for regexp_replace, I'd vote for silently ignoring 'g' if the
> occurrence-number parameter is given, while honoring it if not.
>
> I've already made changes in my local copy per the last item,
> but I've not done anything about throwing errors for out-of-range
> subexpression numbers. Anybody have an opinion about that one?

I thought about this while I was implementing the functions and chose to
not throw an error because of the Oracle behavior and also with others
regular expression implementation. For example in Perl there is no error:

$ perl -e '$str="hello world"; $str =~ s/(l)/$20/; print "$str\n";'
helo world

Usually a regular expression is always tested by its creator to be sure
that this the right one and that it does what is expected. But I agree
that it could help the writer to debug its RE.

Also if I recall well Oracle and DB2 limit the number of capture groups
back references from \1 to \9 for Oracle and \0 to \9 for DB2. I have
chosen to not apply this limit, I don't see the interest of such a
limitation.

--
Gilles Darold
http://www.darold.net/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-08-01 20:55:01 pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead
Previous Message Gilles Darold 2021-08-01 19:22:41 Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace