Bizarre behavior of \w in a regular expression bracket construct

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: "Joel Jacobson" <joel(at)compiler(dot)org>
Subject: Bizarre behavior of \w in a regular expression bracket construct
Date: 2021-02-20 22:20:19
Message-ID: 3220564.1613859619@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Our documentation says specifically "A character class cannot be used
as an endpoint of a range." This should apply to the character class
shorthand escapes (\d and so on) too, and for the most part it does:

# select 'x' ~ '[\d-a]';
ERROR: invalid regular expression: invalid character range

However, certain combinations involving \w don't throw any error:

# select 'x' ~ '[\w-a]';
?column?
----------
t
(1 row)

while others do:

# select 'x' ~ '[\w-;]';
ERROR: invalid regular expression: invalid character range

It turns out that what's happening here is that \w is being
macro-expanded into "[:alnum:]_" (see the brbackw[] constant
in regc_lex.c), so then we have

select 'x' ~ '[[:alnum:]_-a]';

and that's valid as long as '_' is less than the trailing
range bound. The fact that we're using REG_ERANGE for both
"range syntax botch" and "range start is greater than range
end" helps to mask the fact that the wrong thing is happening,
i.e. my last example above is giving the right error string
for the wrong reason.

I thought of changing the expansion to "_[:alnum:]" but of
course that just moves the problem around: then some cases
with \w after a dash would be accepted when they shouldn't be.

I have a patch in progress that gets rid of the hokey macro
expansion implementation of \w and friends, and I noticed
this issue because it started to reject "[\w-_]", which our
existing code accepts. There's a bunch of examples like that
in Joel's Javascript regex corpus. I suspect that Javascript
is reading such cases as "\w plus the literal characters '-'
and '_'", but I'm not 100% sure of that.

Anyway, I don't see any non-invasive way to fix this in the
back branches, and I'm not sure that anyone would appreciate
our changing it in stable branches anyway. But I wanted to
document the issue for the record.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-02-20 22:20:24 Re: Printing page request trace from buffer manager
Previous Message Guillaume Lelarge 2021-02-20 21:39:24 Re: Extensions not dumped when --schema is used