Re: Bizarre behavior of \w in a regular expression bracket construct

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bizarre behavior of \w in a regular expression bracket construct
Date: 2021-02-23 17:15:29
Message-ID: 3873654.1614100529@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
>> It looks like the interpretation of these other engines is that [\d-a]
>> is the set of \d, the literal character "-", and the literal character
>> "a". In other words, the - preceded by \d or \w (or any other character
>> class, I guess?) loses its special meaning of identifying a character
>> range.

> Yeah. While I can see the attraction of being picky about this,
> I can also see the attraction of being more compatible with other
> engines. Should we relax this?

After some more research I'm feeling that this would be a bad idea.
The POSIX spec states that such cases are unspecified, meaning that
implementations can do what they like. Hence Perl and JS are not
out of line to interpret it this way. However, XQuery and therefore
also SQL consider that a character class after a dash means character
set subtraction [1], which is pretty nearly the exact opposite
semantics. Keeping in mind that we are likely to someday want to
provide a closer match for XQuery, I'm thinking we're best off to
keep such cases as an error for now. Otherwise the risk of confusion
will be pretty high.

Anyway, 0001 attached is the promised patch to enable \D, \S, \W
to work inside bracket expressions. I did some cleanup in the
general area as well:

* Create infrastructure to allow treating \w as a character class
in its own right. (I did not expose [[:word:]] as a class name,
though it would be a little more symmetric to do so; should we?)

* Split cclass() into separate functions to look up a char class
name (producing an enum) and to produce a cvec character vector
from the enum. This allows the char class escapes to use the
enum values directly without an artificial lookup.

* Remove the lexnest() hack, and in consequence clean up wordchrs()
to not interact with the lexer.

* Fix colorcomplement() to not be O(N^2) in the number of colors
involved. I didn't detect any measurable speedup on Joel's corpus,
but it seems like a good idea anyway.

* Get rid of useless-as-far-as-I-can-see calls of element()
on single-character character element names in brackpart().
element() always maps these to the character itself, and things
would be quite broken if it didn't --- should "[a]" match something
different than "a" does? Besides, the shortcut path in brackpart()
wasn't doing this anyway, making it even more inconsistent.

0001 preserves the current behavior of these constructs with
respect to newlines, namely that:

\s matches newline, with or without 'n' flag
\S doesn't match newline, with or without 'n' flag
\w doesn't match newline, with or without 'n' flag
\W matches newline, except with 'n' flag
\d doesn't match newline, with or without 'n' flag
\D matches newline, except with 'n' flag

Perl and Javascript believe that \W and \D should match newlines
regardless of their 's' flag, so there's a case for changing
\W and \D to match newline regardless of our 'n' flag. 0002
attached is the quite trivial patch to do this. I'm not quite
100% convinced whether this is a good change to make, but if we're
going to do it now would be the time.

Thoughts?

regards, tom lane

[1] https://www.regular-expressions.info/charclasssubtract.html

Attachment Content-Type Size
0001-rework-char-class-escapes.patch text/x-diff 43.6 KB
0002-DW-always-match-newline.patch text/x-diff 4.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-02-23 17:34:37 Re: Some regular-expression performance hacking
Previous Message Konstantin Knizhnik 2021-02-23 16:29:12 Re: libpq compression