Quick Links

Regexp match with accented character problem

From:	Laslo Forro <getforum(at)gmail(dot)com>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Regexp match with accented character problem
Date:	2010-06-08 08:48:53
Message-ID:	AANLkTinhs32woCPg8neaTb3jEqde7BRHx0P0_rxgo_0_@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

Hi there, could someone drop me a hint on the whys at below?

The table:

test=# select * from texts;
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)

Now, I want to match 'macskacicó' WORD.

It works:
test=# select * from texts where title ~* E'macskacicó';
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicó.
(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';
title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented
character:
test=# select * from texts where title ~* E'\\mtower\\M';
title | a_text
----------------+-----------
The dark tower | Blah blah
(1 row)

It seems that accented characters are not recognized as \w. (It
matches: select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form
and feed it that way? Or is there a more elegant way to redefine the \w
class?

Thanks a lot!

I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives:
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Responses

Re: Regexp match with accented character problem at 2010-06-08 09:45:48 from Thom Brown
Re: Regexp match with accented character problem at 2010-06-08 13:53:02 from Tom Lane

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Thom Brown	2010-06-08 09:45:48	Re: Regexp match with accented character problem
Previous Message	Andrej	2010-06-08 05:48:25	Re: The Two Towers