Re: Assert failure with ICU support

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Assert failure with ICU support
Date: 2023-04-19 15:42:20
Message-ID: 2983095.1681918940@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Richard Guo <guofenglinux(at)gmail(dot)com> writes:
> I happened to run into an assert failure by chance with ICU support.
> Here is the query:

> SELECT '1' SIMILAR TO '\൧';

> The failure happens in lexescape(),

> default:
> assert(iscalpha(c));
> FAILW(REG_EESCAPE); /* unknown alphabetic escape */
> break;

> Without ICU support, the same query just gives an error.

Interesting.

> FWIW, I googled a bit and '൧' seems to be number 1 in Malayalam.

The code in lexescape() is assuming that if "c" passes
iscalnum(), then either it's '0'-'9' or it passes iscalpha().
This is clearly wrong in Unicode-land, which has non-ASCII digits.
I imagine you can find libc locales where this fails, not only ICU.

I think the question here is what we want to do with such cases:
throw a regex syntax error, or just return the character as-is?
The fine manual says that if the character after '\' is
alphanumeric, it's an escape, and otherwise the character is
quoted literally. But how shall we interpret "alphanumeric"?

I'm kind of inclined to the idea that anything that's not ASCII
should be considered to be literally quoted by '\', rather than
being an erroneous regex escape. Maybe I'm too English-centric.
But I don't like the idea that what is a valid regex should vary
depending on locale.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Mark Guertin 2023-04-19 15:45:51 pg_basebackup: errors on macOS on directories with ".DS_Store" files
Previous Message Richard Guo 2023-04-19 10:30:10 Assert failure with ICU support