Re: Bizarre behavior of \w in a regular expression bracket construct

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bizarre behavior of \w in a regular expression bracket construct
Date: 2021-02-24 17:09:02
Message-ID: 4099447.1614186542@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joel Jacobson" <joel(at)compiler(dot)org> writes:
> On Tue, Feb 23, 2021, at 18:15, Tom Lane wrote:
>> Perl and Javascript believe that \W and \D should match newlines
>> regardless of their 's' flag, so there's a case for changing
>> \W and \D to match newline regardless of our 'n' flag. 0002
>> attached is the quite trivial patch to do this. I'm not quite
>> 100% convinced whether this is a good change to make, but if we're
>> going to do it now would be the time.

> [ extensive analysis ]
> My opinion is therefore we should change \W to include newlines.

Wow, thanks for doing all that work! But OTOH, looking at a
corpus taken from Javascript practice seems like it'd inevitably
lead to that conclusion, since that is what \W does in Javascript.
Whether the regex authors knew the exact rules or not (and I share
your suspicions that some of them didn't), if they'd done any
testing they'd have been led to write their code that way.

Still, I am not convinced that there's much to justify our current
definition either. Looking at the existing code shows that the way
\W and \D work now was forced by Spencer's decision to make 'n' mode
affect complemented character classes in general, since they're just
macros for complemented character classes. With this reimplementation,
that connection isn't there anymore, so we can change it if we like.

Since (AFAICS) the main use of 'n' mode is to make our regexes work
more like these other products, bringing \W and \D into line with
them seems like a reasonable thing to do.

I've also decided after reflection that the patch should indeed
create a named "word" character class. That's allowed per POSIX,
and it simplifies some aspects of the documentation, since we can
rely on referencing the class instead of repeating ourselves.
The attached 0001 v2 does that; it's otherwise the same as before.

Speaking of documentation, I'm wondering more and more why we're
continuing to carry along re_syntax.n. We don't expose that to
users in any way, and it has not been maintained nearly as faithfully
as the SGML docs. (Looking at the git history, I think I included
it in 7bcc6d98f because it replaced re_format.7, which had been there
in that directory since Postgres95. But that history is immaterial
now that we've got proper user-facing documentation.)

regards, tom lane

#text/x-diff; name="0001-rework-char-class-escapes-2.patch" [0001-rework-char-class-escapes-2.patch] /home/tgl/pgsql/0001-rework-char-class-escapes-2.patch
#text/x-diff; name="0002-DW-always-match-newline.patch" [0002-DW-always-match-newline.patch] /home/tgl/pgsql/0002-DW-always-match-newline.patch

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-24 17:11:51 Re: Bizarre behavior of \w in a regular expression bracket construct
Previous Message Alexandre Arruda 2021-02-24 16:59:39 Re: [Proposal] Global temporary tables