Bug concerning regular expressions and UTF-8

From: Helmar Spangenberg <hspangenberg(at)frey(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Bug concerning regular expressions and UTF-8
Date: 2006-01-20 17:03:17
Message-ID: 200601201803.17706.hspangenberg@frey.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produce
a faulty result is

select 'XXXMÜLLERYyyy' ~* '.*müller.*';

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see also
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei
Regular Expressions"). It seems that this problem does not exist in Windows
based installations.

It seems to me that this bug is originated in the file
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and
tolower(unsigned char) which definitely are the wrong choice for UTF8
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",
which lead to the correct results of
select 'XXXMÜLLERYyyy' ~* '.*müller.*';

Since I don't have any idea concerning the side effects of this change, please
let me know as soon as an "official" patch is available - I definitely do
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: hspangenberg(at)frey(dot)de

Browse pgsql-bugs by date

  From Date Subject
Next Message John Jorgensen 2006-01-20 19:14:03 BUG #2192: misbehaving IRIX strtod() subverts parsing of "infinity"
Previous Message Robert Treat 2006-01-20 16:48:43 Re: improper estimates even with high statistic values