[PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From: Gilles Darold <gilles(at)darold(dot)net>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Date: 2021-03-03 09:15:57
Message-ID: fc160ee0-c843-b024-29bb-97b5da61971f@darold.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/18/adfns/regexp.html#GUID-F14733F3-B943-4BAD-8489-F9704986386B
IBM:
https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061494.html?pos=2
Z/OS:
https://www.ibm.com/support/knowledgecenter/SSEPEK_12.0.0/sqlref/src/tpc/db2z_bif_regexplike.html
EDB:
https://www.enterprisedb.com/edb-docs/d/edb-postgres-advanced-server/reference/database-compatibility-for-oracle-developers-reference-guide/9.6/Database_Compatibility_for_Oracle_Developers_Reference_Guide.1.098.html

Hi,

I would like to suggest adding the $subject functions to PostgreSQL. We
can do lot of things using regexp_matches() and regexp_replace() but
some time it consist on building complex queries that these functions
can greatly simplify.

Look like all RDBMS that embedded a regexp engine implement these
functions (Oracle, DB2, MySQL, etc) but I don't know if they are part of
the SQL standard. Probably using regexp_matches() can be enough even if
it generates more complex statements but having these functions in
PostgreSQL could be useful for users and code coming from theses RDBMS.

  - REGEXP_COUNT( string text, pattern text, [, position int] [, flags
text ] ) -> integer

        Return the number of times a pattern occurs in a source string
after a certain position, default from beginning.

        It can be implemented in PostgreSQL as a subquery using:

            SELECT count(*) FROM regexp_matches('A1B2C3', '[A-Z][0-9]',
'g'); -> 3

        To support positioning we have to use substr(), for example
starting at position 2:

            SELECT count(*) FROM regexp_matches(substr('A1B2C3', 2),
'[A-Z][0-9]'); -> 2

        With regexp_count() we can simply use it like this:

            SELECT regexp_count('A1B2C3', '[A-Z][0-9]'); -> 3
            SELECT regexp_count('A1B2C3', '[A-Z][0-9]', 2); -> 2

  - REGEXP_INSTR( string text, pattern text, [, position int] [,
occurrence int] [, return_opt int ] [, flags text ] [, group int] ) ->
integer

        Return the position in a string for a regular expression
pattern. It returns an integer indicating the beginning or ending
position of the matched substring, depending on the value of the
return_opt argument (default beginning). If no match is found, then the
function returns 0.

            * position: indicates the character where the search should
begin.
            * occurrence: indicates which occurrence of pattern found in
string should be search.
            * return_opt: 0 mean returns the position of the first
character of the occurrence, 1 mean returns the position of the
character following the occurrence.
            * flags: regular expression modifiers.
            * group: indicates which subexpression in pattern is the
target of the function.

        Example:

            SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1,
0, 'i', 4); -> 7

        to obtain a PostgreSQL equivalent:

            SELECT position((SELECT (regexp_matches('1234567890',
'(123)(4(56)(78))', 'ig'))[4] offset 0 limit 1) IN '1234567890');

  - REGEXP_SUBSTR( string text, pattern text, [, position int] [,
occurrence int] [, flags text ] [, group int] ) -> text

        It is similar to regexp_instr(), but instead of returning the
position of the substring, it returns the substring itself.

        Example:

            SELECT regexp_substr('500 gilles''s street, 38000 Grenoble,
FR', ',[^,]+,'); -> , 38000 Grenoble,

        or with a more complex extraction:

            SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1,
'i', 4); -> 78
            SELECT regexp_substr('1234567890 1234557890',
'(123)(4(5[56])(78))', 1, 2, 'i', 3); -> 55

        To obtain the same result for the last example we have to use:

            SELECT (SELECT * FROM regexp_matches('1234567890
1234557890', '(123)(4(5[56])(78))',  'g') offset 1 limit 2)[3];

I have not implemented the regexp_like() function, it is quite similar
than the ~ and ~* operators except that it can also support other
modifiers than 'i'. I can implement it easily and add it to the patch if
we want to supports all those common functions.

  - REGEXP_LIKE( string text, pattern text, [, flags text ] ) -> boolean

        Similar to the LIKE condition, except that it performs regular
expression matching instead of the simple pattern matching performed by
LIKE.

        Example:

            SELECT * FROM t1 WHERE regexp_like(col1, '^d$', 'm');

        to obtain a PostgreSQL equivalent:

            SELECT * FROM t1 WHERE regexp_match (col1, '^d$', 'm' ) IS
NOT NULL;

There is also a possible extension to regexp_replace() that I have not
implemented yet because it need more work than the previous functions.

  - REGEXP_REPLACE( string text, pattern text, replace_string text, [,
position int] [, occurrence int] [, flags text ] )

    Extend PostgreSQL regexp_replace() by adding position and occurrence
capabilities.

The patch is ready for testing with documentation and regression tests.

Best regards,

--
Gilles Darold
LzLabs GmbH

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2021-03-03 09:22:16 Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Previous Message Daniel Gustafsson 2021-03-03 09:04:58 Re: pg_upgrade version checking questions