Regexp match with accented character problem

From: Laslo Forro <getforum(at)gmail(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Regexp match with accented character problem
Date: 2010-06-08 08:48:53
Message-ID: AANLkTinhs32woCPg8neaTb3jEqde7BRHx0P0_rxgo_0_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Hi there, could someone drop me a hint on the whys at below?

The table:

test=# select * from texts;
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)

Now, I want to match 'macskacicó' WORD.

It works:
test=# select * from texts where title ~* E'macskacicó';
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicó.
(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';
title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented
character:
test=# select * from texts where title ~* E'\\mtower\\M';
title | a_text
----------------+-----------
The dark tower | Blah blah
(1 row)

It seems that accented characters are not recognized as \w. (It
matches: select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form
and feed it that way? Or is there a more elegant way to redefine the \w
class?

Thanks a lot!

I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives:
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Thom Brown 2010-06-08 09:45:48 Re: Regexp match with accented character problem
Previous Message Andrej 2010-06-08 05:48:25 Re: The Two Towers