Re: Bizarre behavior of \w in a regular expression bracket construct

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bizarre behavior of \w in a regular expression bracket construct
Date: 2021-02-21 07:13:11
Message-ID: c0f3c3b4-89b2-49ba-ba66-c4462dbc4da8@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 20, 2021, at 23:20, Tom Lane wrote:
>I have a patch in progress that gets rid of the hokey macro
>expansion implementation of \w and friends, and I noticed
>this issue because it started to reject "[\w-_]", which our
>existing code accepts. There's a bunch of examples like that
>in Joel's Javascript regex corpus. I suspect that Javascript
>is reading such cases as "\w plus the literal characters '-'
>and '_'", but I'm not 100% sure of that.

In an attempt trying to demystify how \w works in various regex engines,
I created a test to deduce the matching ranges for a given bracket expression.

In the ASCII mode, it just tries all characters between 1...255:

regex | engine | deduced_ranges
------------+--------+-------------------------------
^([a-z])$ | pg | [a-z]
^([a-z])$ | pl | [a-z]
^([a-z])$ | v8 | [a-z]
^([\d-a])$ | pg |
^([\d-a])$ | pl | [-0-9a]
^([\d-a])$ | v8 | [-0-9a]
^([\w-;])$ | pg |
^([\w-;])$ | pl | [-0-9;A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-;])$ | v8 | [-0-9;A-Z_a-z]
^([\w-_])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-_])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-_])$ | v8 | [-0-9A-Z_a-z]
^([\w])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w])$ | pl | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w])$ | v8 | [0-9A-Z_a-z]
^([\W])$ | pg |
^([\W])$ | pl | [\x01-/:-(at)[-^`{-©«-´¶-¹»-¿×÷]
^([\W])$ | v8 | [\x01-/:-(at)[-^`{-ÿ]
^([\w-a])$ | pg | [0-9A-Z_-zªµºÀ-ÖØ-öø-ÿ]
^([\w-a])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-a])$ | v8 | [-0-9A-Z_a-z]

In the UTF8 mode, it generates a 10000 random valid UTF-8 byte sequences converted to text.
This will of course leave a lot of gaps, but one gets the idea on what ranges there are.

regex | engine | deduced_ranges
------------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------
^([a-z])$ | pg | [a-z]
^([a-z])$ | pl | [a-z]
^([a-z])$ | v8 | [a-z]
^([\d-a])$ | pg | ERROR
^([\d-a])$ | pl | [-0-9a٤-٦٩۲-۴۶۸-۹߀-߃߅߇०२४९১-২৫-৬੦੩-੪੬੯૧૭ ... 5 chars ... -୯௧௩௫౦౩౯೧೩-೫೯൧൪-൯෧෫໐໖-໗໙༡-༢២᠔᥎᪁᮵258-9𐒨𝟿]
^([\d-a])$ | v8 | [-0-9a]
^([\w-;])$ | pg | ERROR
^([\w-;])$ | pl | [-0-9;A-Z_a-zÀÂÆÌÎ-ÐÔÙ-Úß-áéëîñó-õø-ùûýÿ ... 3901 chars ... 𭈞𭈢𭈴𭑇𭒐𭕵𭖋𭙋𭟞𭢘𭥋𭥬𭧊𭧝𭫘𭯙𭯟𭶾𭷵𭸴𭹊𭻚𭼁𭽁𭾠𮄖𮅮𮉵𮏲𮕙𮛣𮝎𮣂𮥑𮪨忹殺灊鏹]
^([\w-;])$ | v8 | [-0-9;A-Z_a-z]
^([\w-_])$ | pg | [0-9A-Z_a-zªÁÆ-ÇÍ-ÒÔÙ-ÚÜÞáä-æèë-ìî-ïñõùý ... 3704 chars ... 𭍱𭓆𭓡𭕆𭖋𭖮𭘤𭙬𭣯𭦞𭬍𭭈𭲌𭶓𭶶𭷻𭹣𭹩𭼪𭾘𭿡𮄄𮄿𮆟𮆢𮇴𮋬𮍠𮏕𮒹𮜒𮝒𮡺𮦐𮨲𮩣𡛪韠𪊑]
^([\w-_])$ | pl | [-0-9A-Z_a-zªµÀ-ÁÅÈÊÑÓÕ-ÖØÚà-áã-æê-ìîð-ó ... 3884 chars ... 𭙐𭙥𭛏𭜆𭝃𭞗𭟺𭠼𭥮𭧕𭧙𭫢𭯛𭲠𭷱𭸡𭾉𮁣𮃦𮄫𮈔𮉞𮊀𮑳𮕝𮘊𮘚𮛍𮣝𮧕𮩺𮪇𮬊𮬡𡬘㩬茝鄛󠇂]
^([\w-_])$ | v8 | [-0-9A-Z_a-z]
^([\w])$ | pg | [0-9A-Z_a-zÃÇÉ-ÊÍ-ÎÐÒÖÙÛ-Þà-âåêî-ðò-ôöøú ... 3803 chars ... 𭏟-𭏠𭗷𭘱𭚆𭛿𭝵𭡓𭢕𭩪𭬞𭭆𭭾𭮺𭯌𭰅𭱇𭲩𭶧𭷡𭹿𭺟𮀑𮆔𮇩𮇰𮈯𮋷𮌜𮌨𮞄-𮞅𮩧𮫷𮬕𮮿舁]
^([\w])$ | pl | [0-9A-Z_a-zºÁÄ-ÆÉÍ-ÎÐÓ-ÔÖÙÛ-àâ-æéíð-ñø-ù ... 3881 chars ... 𭙗𭙳𭛨𭞌𭣘𭤁𭥖𭥜𭥷𭦋𭧺𭯊𭸘𭹍𭼷𭿰𮁵𮈅𮈇𮊩𮖛𮖹𮘠𮚞𮜞𮝀𮟟𮡖𮣝𮦖𮦘𮧏𮬅𮭁𮮟𮯓𦾱嶲󠇋]
^([\w])$ | v8 | [0-9A-Z_a-z]
^([\W])$ | pg | ERROR
^([\W])$ | pl | [\x01-/:-(at)[-^`{-\x7F\u0085-\u0089\u008B-\u008C\u008E-\u0092\u0098¥-§©«-¯±-²¸×˄-˅ ... 4264 chars ... 􏞢􏟆􏟐􏟘􏢄􏣢􏥭􏦡􏧎􏧰􏩤􏪃􏪠􏪵􏫎􏫤􏬌􏭇􏭴􏭷􏮩􏮷􏯭􏯴􏯾􏰬􏲡􏲾􏳧􏳵􏵡􏶾􏷤􏷫􏹶􏺷􏼁􏽷􏿵]
^([\W])$ | v8 | [\x01-/:-(at)[-^`{-\u0080\u0084\u0087\u008C\u008F\u0091\u0096\u009A -¡¥§ª-«®-°²-³µ¹¿ÁÄ ... 4855 chars ... -BGJLQT-Ubgkr-sy}「-」ェャスハホムᄀ-ᄁ좌￐ᅭᅵ￧↑￾]
^([\w-a])$ | pg | [0-9A-Z_-zªºÁ-ÃÇÌ-ÎÐ-ÑÔÖÝâ-ãå-æé-êìî-ñõü ... 3717 chars ... 𭝕𭟞𭡂𭡶𭤇𭥷𭦃𭧝𭮄-𭮅𭳐𭴁𭵦𭷥𭸍𭾙𭿘𮅕𮅳𮆈𮍪𮚝𮛶𮜠𮝁𮠦𮣆𮣼𮥴𮨨𮭘𮮛仌壮望-朡變]
^([\w-a])$ | pl | [-0-9A-Z_a-zºÁÃÇÉ-ÊÏÒ-ÔÖØÚ-ÛÞáäæí-ïõúü-ý ... 3854 chars ... 𭏇𭒧𭔃𭔽𭙟𭞽𭡖𭢮𭢱𭤙𭤶𭧝𭪁𭪻𭯰𭰭𭲟𭳚𭵊𭵽𭸷𭾏𮂗𮃴𮈄𮋝𮌫𮍏𮚅𮞞𮠾𮡊𮡿𮢐𮨍兤潮䏕𩅅]
^([\w-a])$ | v8 | [-0-9A-Z_a-z]

pg=PostgreSQL
pl=Perl
v8=Javascript

I think the use of \w and \W should be considered an anti-pattern when writing regexes, in any language,
due to the apparent variations between popular engines. It will never be obvious to neither the reader
nor writer of the regex what was meant or what it means.

/Joel

Attachment Content-Type Size
brute_matches.sql application/octet-stream 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Guo 2021-02-21 07:45:43 Re: Freeze the inserted tuples during CTAS?
Previous Message Tom Lane 2021-02-21 06:10:16 Re: Mysterious ::text::char cast: ascii(chr(32)::char) = 0