Re: Bizarre behavior of \w in a regular expression bracket construct

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bizarre behavior of \w in a regular expression bracket construct
Date: 2021-02-24 15:23:07
Message-ID: 6e8d0fd6-a9f1-4cd8-b9cf-3a5ab5d96313@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Feb 23, 2021, at 18:15, Tom Lane wrote:
>0001 preserves the current behavior of these constructs with
>respect to newlines, namely that:
>
>\s matches newline, with or without 'n' flag
>\S doesn't match newline, with or without 'n' flag
>\w doesn't match newline, with or without 'n' flag
>\W matches newline, except with 'n' flag
>\d doesn't match newline, with or without 'n' flag
>\D matches newline, except with 'n' flag
>
>Perl and Javascript believe that \W and \D should match newlines
>regardless of their 's' flag, so there's a case for changing
>\W and \D to match newline regardless of our 'n' flag. 0002
>attached is the quite trivial patch to do this. I'm not quite
>100% convinced whether this is a good change to make, but if we're
>going to do it now would be the time.
>
>Thoughts?

I've tested 4.4M different regex/subject pairs
against 0001 and 0001+0002 trying to find
some interesting examples to analyze:

SELECT COUNT(*) FROM regex_tests;
4468843

Out of these, 64783 (1.4%) contained \W
that could be processed by the regex engine
and that didn't produce an error:

CREATE TABLE "\W" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS NULL AND pattern LIKE '%\\W%';
SELECT 64783

Out of these, 539 gave a different result
when comparing 0001 vs 0001+0002:

CREATE TABLE "\W diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 'n') AS captured_pg_0001 FROM "\W" WHERE captured_pg IS DISTINCT FROM regexp_match(subject, '('||pattern||')', 'n');
SELECT 539

Out of these, 62 didn't contain any \W
when the special [\w\W] construct had been filtered out.

CREATE TABLE "\W diff ignore [\w\W]" AS SELECT * FROM "\W diff" WHERE regexp_replace(pattern,'\[\\w\\W\]','','g') LIKE '%\\W%';
SELECT 62

Out of these, here is a break-down showing number of distinct subjects per pattern:

SELECT COUNT(*), pattern FROM "\W diff ignore [\w\W]" GROUP BY 2 ORDER BY 1 DESC;
count | pattern
-------+--------------------------------------------------
47 | (?:^|\W+)@apply\s*\(?([^);\n]*)\)?
12 | \W
1 | ((?:^|}|,|;)\W*)((?:\w+)?\.(?:mc|mg|row)[\-\w]+)
1 | [\W\d]+
1 | \W*$
(5 rows)

Let's go through each case:

Pattern #1: (?:^|\W+)@apply\s*\(?([^);\n]*)\)?
====================================

This pattern is always used with the flags "gi".

Example subject:

font-family: var(--paper-font-common-base_-_font-family); -webkit-font-smoothing: var(--paper-font-common-base_-_-webkit-font-smoothing);
@apply --paper-font-common-nowrap;

If the author would have intended to only match non-word characters without newlines,
then these kind of subjects would only match by coincidence, since @apply in indented
using blank space, which is included in \W.

The \W+ in this example makes the regex match the ");" on the line before "@apply", which looks very odd.

My conclusion is the author in this example wrongly think \W+ means "at least one white space".

I therefore it would be an improvement in this case to always include newlines in \W.

Patch 0002 therefore gets +1 due to this example.

Pattern 2: \W
============

Flags used for this pattern (among all examples, not just the ones producing a diff):

SELECT flags, count FROM patterns WHERE pattern = '\W' ORDER BY 2 DESC;
flags | count
-------+-------
g | 2805
| 1476
gi | 39
y | 22
(4 rows)

All subjects for this pattern had some white-space in the beginning,
and all of them even have at least one new-line in the beginning:

SELECT length((regexp_match(subject,'^(\n*)'))[1]), COUNT(*) FROM "\W diff ignore [\w\W]" WHERE pattern = '\W' GROUP BY 1 ORDER BY 1;
length | count
--------+-------
1 | 9
2 | 1
3 | 2
(3 rows)

This, in combination with the popularity of the "g" flag with this pattern,
makes me think \W is used to strip away leading white-space,
including new-lines.

Patch 0002 therefore gets +1 due to this example.

Pattern 3: ((?:^|}|,|;)\W*)((?:\w+)?\.(?:mc|mg|row)[\-\w]+)
==============================================

Flags: g

Subject:

div.mgline:hover a.close-informer {
opacity: 0.7;
-moz-transition: all 0.3s ease-out;
-o-transition: all 0.3s ease-out;
-webkit-transition: all 0.3s ease-out;
-ms-transition: all 0.3s ease-out;
transition: all 0.3s ease-out;
}

To me it looks like the author wrongly thinks \W means "white space".

What makes me believe this is that \W* is in between

(?:^|}|,|;)

which matches end of statements, and,

(?:\w+)?\.

which matches a HTML-tag and CSS class name, or just a CSS class name.

The only natural thing I see could exist in between those two constructs is white space.

Normally this regex doesn't produce any difference for cases found,
since most CSS code has been minified where newlines are removed,
but the case above was not minified and produced a diff.

Patch 0002 therefore gets +1 due to this example.

Pattern 4: [\W\d]+
================

No flags for this pattern.

The case that caused a diff was a subject with just a single comma, followed by newline and then blank spaces.

Subject in hex: 2c 0a 09 09 09 09 09 09 09

This caused 0001 to only match the comma,
whereas 0002 (and Javascript/Perl) matches the blank spaces as well.

Here are some other subjects that don't necessarily cause a diff,
but that could hopefully makes us understand the intent of the regex:

SELECT DISTINCT ON (regexp_match_v8) * FROM (SELECT regexp_match_v8(subject,'[\W\d]+'), shrink_text(subject,40) FROM subjects WHERE pattern_id = 25935) AS x;
regexp_match_v8 | shrink_text
------------------------------------------------------------+-------------------------------------------------------------
{", +| , +
"} |
{", "} | ,
{.} | .col-item
{/} | /content/phonak/se/s ... 106 chars ... e.jpg, (largeretina)
{//} | //images.images4us.c ... 53 chars ... -481919.png, (large)
{://} | https://www.dilling. ... 55 chars ... .webp, (medium-only)
{3} | typo3conf/ext/rlp/Re ... 23 chars ... lp-logo.png, (large)
(7 rows)

We can see the diffing case on the first line, the one with comma and newlines+blank spaces.
No clue on what that one is, but looking at the rest,
to me it looks like they are trying to match the the non-word characters in the beginning.
The strange thing is why \d is included in the bracket expression.
This causes a different in the last example:

{3} | typo3conf/ext/rlp/Re ... 23 chars ... lp-logo.png, (large)

If \d would not have been included, the first "/" would be matched instead of the "3".

I cannot draw any conclusions for this pattern on what would be advisable,
except that in most cases for this pattern, it wouldn't make any difference to include
or not include newlines in \W.

Pattern 5: \W*$
==============

No flags for this pattern.

The subject is redacted due to being a promotional text for some cryptocurrency.
it's just four normal English sentences, where the last one is separated from the first three
with two newlines in between, rewritten:

"Example sentence. Some other sentence.

Yet some other sentence. "

Double-quotes added to show the trailing blank space in the last sentence.
Due to it, the 'n' regex flag causes the dot and newline to match with the 0002 patch,
but only match the dot without the 0002 patch.

In Javascript/Perl, since $ only means end-of-string there (unless using the "m" flag),
they instead match the last blank space. 0002 would give the same behaviour without the "n" flag.

My conclusion is \W*$ is typically wrongly used to remove trailing white-space.

Always including newlines in \W would be an improvement here,
since otherwise newlines wouldn't be stripped.

Patch 0002 therefore gets +1 due to this example.

======END OF PATTERNS=====

Final conclusion:

Out of the 5 patterns analyzed,
I found 4 of them would benefit from including newlines in \W.

The risk of changing this seems rather small,
since only 0.01% of the cases found produced
any difference at all (539 out of 4468843),
and out of these cases, most only contained
the obvious [\w\W] which greatly benefits,
and the rest of the 62 cases have now been
manually verified to also benefit from a change.

My opinion is therefore we should change \W to include newlines.

I will hopefully be able to provide a similar analysis of \D soon,
but wanted to send this in the meantime.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2021-02-24 15:46:54 Re: Extensibility of the PostgreSQL wire protocol
Previous Message Dilip Kumar 2021-02-24 14:51:32 Re: Is Recovery actually paused?