BUG #4044: Incorrect RegExp substring Output

From: "Rui Martins" <Rui(dot)Martins(at)pdmfc(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #4044: Incorrect RegExp substring Output
Date: 2008-03-18 18:46:20
Message-ID: 200803181846.m2IIkKwE078848@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 4044
Logged by: Rui Martins
Email address: Rui(dot)Martins(at)pdmfc(dot)com
PostgreSQL version: 8.3
Operating system: Win2000
Description: Incorrect RegExp substring Output
Details:

To setup for test, run these SQL commands

CREATE TABLE TEST_REGEXP
(
BedNo VARCHAR(20)
);

INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '123' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '4325:1' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '2464M' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '5678M:2' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '453L:1' );

From the following link
http://www.postgresql.org/docs/8.3/static/functions-matching.html

We can read the following text:

The substring function with two parameters, substring(string from pattern),
provides extraction of a substring that matches a POSIX regular expression
pattern. It returns null if there is no match, otherwise the portion of the
text that matched the pattern. But if the pattern contains any parentheses,
the portion of the text that matched the first parenthesized subexpression
(the one whose left parenthesis comes first) is returned. You can put
parentheses around the whole expression if you want to use parentheses
within it without triggering this exception. If you need parentheses in the
pattern before the subexpression you want to extract, see the non-capturing
parentheses described below.

-- -- -- -- --
No suppose we want to split up the "BedNo" column, into its parts
(DoorNumber, RoomSize and BedNumber)

SELECT BedNo,
SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP;

Or in it's other form

SELECT BedNo,
SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP;

Both will return the same result:

BedNo DoorNumber RoomSize BedNumber
123 123 123
4325:1 4325 :1
2464M 2464 M 2464M
5678M:2 5678 M :2
453L:1 453 L :1

NOTE: Tabs may note be seen correctly on a web interface.

This is clearly, not the expected result, in particular for BedNo IN (
'123', '2464M' )

The BedNumber returned in these cases is NOT logical!

However, the manual is NOT EXPLICIT in what happens, if the returned MATCHED
parentheses part is the equivalent of an empty string!

Although it states:
"But if the pattern contains any parentheses, the portion of the text that
matched the first parenthesized subexpression (the one whose left
parenthesis comes first) is returned."

Apparently, the function is returning the entire MATCHED string, instead of
just the parenthesized subexpression.

I would expect the result for BedNumber to be either NULL or the EMPTY
String, and the later seems more logical. But the documentation doesn't
state which should be returned!

Not withstanding, the expected result should be:

BedNo DoorNumber RoomSize BedNumber
123 123
4325:1 4325 :1
2464M 2464 M
5678M:2 5678 M :2
453L:1 453 L :1

NOTE: Tabs may note be seen correctly on a web interface.

-- Hack Note for this specific case !

For this specific case, we can do a hack, and change the RegExp for
BedNumber, like in the next SELECT.

SELECT BedNo,
SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP;

or alternatively

SELECT BedNo,
SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP

This will return NULL when there is no BedNumber, by forcing the regExp to
fail the match.

But this only works, because the format uses ":" in that specific location,
if there was no ":" anywhere, we would NOT get away so easily.

Hope to have bee of help, in finding this documentation issue and
implementation bug.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dave Page 2008-03-18 19:12:13 Re: BUG #4041: error in Application Stack Builder
Previous Message Alvaro Herrera 2008-03-18 18:24:57 Re: BUG #4043: Unrecognized node type: using plpgsql