Skip site navigation (1) Skip section navigation (2)

Bug concerning regular expressions and UTF-8

From: Helmar Spangenberg <hspangenberg(at)frey(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Bug concerning regular expressions and UTF-8
Date: 2006-01-20 17:03:17
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-bugs
Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by 
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produce 
a faulty result is

select 'XXXMÜLLERYyyy' ~* '.*müller.*'; 

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see also 
discussion in, subject "Konfiguration", thread "Umlaute bei 
Regular Expressions"). It seems that this problem does not exist in Windows 
based installations.

It seems to me that this bug is originated in the file 
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and 
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and 
tolower(unsigned char) which definitely are the wrong choice for UTF8 
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and 
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);", 
which lead to the correct results of 
select 'XXXMÜLLERYyyy' ~* '.*müller.*';

Since I don't have any idea concerning the side effects of this change, please 
let me know as soon as an "official" patch is available - I definitely do 
need regular expressions handling UTF8 correctly...

Helmar Spangenberg
e-mail: hspangenberg(at)frey(dot)de

pgsql-bugs by date

Next:From: John JorgensenDate: 2006-01-20 19:14:03
Subject: BUG #2192: misbehaving IRIX strtod() subverts parsing of "infinity"
Previous:From: Robert TreatDate: 2006-01-20 16:48:43
Subject: Re: improper estimates even with high statistic values

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group