Re: Bizarre behavior of \w in a regular expression bracket construct

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bizarre behavior of \w in a regular expression bracket construct
Date: 2021-02-24 16:03:39
Message-ID: 04845093-8b64-4293-87a8-42294b5493ac@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 24, 2021, at 16:23, I wrote:
>I will hopefully be able to provide a similar analysis of \D soon,
>but wanted to send this in the meantime.

CREATE TABLE "\D" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS NULL AND pattern LIKE '%\\D%';
SELECT 67558

CREATE TABLE "\D diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 'n') AS captured_pg_0001 FROM "\D" WHERE captured_pg IS DISTINCT FROM regexp_match(subject, '('||pattern||')', 'n');
SELECT 12

SELECT COUNT(*), pattern FROM "\D diff" GROUP BY 2 ORDER BY 1 DESC;
count | pattern
-------+----------
11 | \D
1 | [\D|\d]*
(2 rows)

Pattern 1: \D
============

This pattern is used to find the first decimal separator, normally dot (.):

SELECT subject FROM regex_tests WHERE pattern = '\D' ORDER BY RANDOM() LIMIT 10;
subject
---------------------------
1.11.00.24975645674952163
1.11.30.6944442955860683
1.12.40.38502468714280424
3.5.10.9407443094500285
1.12.40.34334381021879845
2.0.20.5175496920692813
1.8.30.09144561055484002
3.4.10.6083619758942858
3.5.10.15406771889459425
2.0.00.6309370335082272
(10 rows)

We can see how this works in almost all cases:

SELECT captured_pg, captured_v8, count(*) from regex_tests where pattern = '\D' GROUP BY 1,2 ORDER BY 3 DESC LIMIT 3;
captured_pg | captured_v8 | count
-------------+-------------+-------
{.} | {.} | 66797
| | 103
{-} | {-} | 64
(10 rows)

If we take a look at the diffs found,
all such cases have a subjects that starts with newlines:

SELECT COUNT(*), subject ~ '^\n' AS starts_with_newline FROM "\D diff" WHERE pattern = '\D' GROUP BY 2;
count | starts_with_newline
-------+---------------------
11 | t
(1 row)

Naturally, if newlines are not included, then something else will match instead.

Now, if in these cases, ignoring the newline(s) and instead proceeding
to match the first non-digit non-newline, maybe we wound find a dot (.)
like in the normal case? No, that is not the case. Instead, we will hit
some arbitrary blank space or tab:

SELECT convert_to(captured_pg[1],'utf8') AS "0001+0002", convert_to(captured_pg_0001[1],'utf8') AS "0001", COUNT(*) FROM "\D diff" WHERE pattern = '\D' GROUP BY 1,2;
0001+0002 | 0001 | count
-----------+------+-------
\x0a | \x09 | 3
\x0a | \x20 | 7
\x0a | | 1
(3 rows)

The last example where nothing at all matched, was due to the string only contained a single newline,
which couldn't be matched.

None of these outliners contain any decimal-looking-digit-sequences at all,
it's all just white space, one "€ EUR" text and some text that looks like
it's coming from some web shop's title:

SELECT ROW_NUMBER() OVER (), subject FROM "\D diff" WHERE pattern = '\D';
row_number | subject
------------+----------------------------------------------------------------
1 | +
| +
| +
|
2 | +
|
3 | +
|
4 | +
|
5 | +
| € EUR +
|
6 | +
|
7 | +
|
8 | +
|
9 | +
|
10 | +
| Dunjackor, duntäcken och dunkuddar | Joutsen Dunspecialist+
| +
| +
| +
| – Joutsen Sweden +
| +
|
11 | +
|
(11 rows)

My conclusion is all of these are nonsensical subjects when applied to the \D regex.

Out of the subjects with actual digit-sequences,
none of them starts with newlines,
so including newlines in \D wouldn't cause any effect.

I see no benefit, but also no harm, in including newlines.

Pattern 2: [\D|\d]*
===============

This looks similar to [\w\W], the author has probably not understood pipe ("|") is not needed in between bracket expression parts. The author's intention is probably to match everything in the string, like .*, but including newlines.

Patch 0002 therefore gets +1 due to this example.

===END OF PATTERNS===

My final conclusion is we should always include newlines in \D.

/Joel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Lesovsky 2021-02-24 16:15:14 Re: Asynchronous and "direct" IO support for PostgreSQL.
Previous Message Jan Wieck 2021-02-24 15:46:54 Re: Extensibility of the PostgreSQL wire protocol