Skip site navigation (1) Skip section navigation (2)

Regexp match with accented character problem

From: Laslo Forro <getforum(at)gmail(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Regexp match with accented character problem
Date: 2010-06-08 08:48:53
Message-ID: AANLkTinhs32woCPg8neaTb3jEqde7BRHx0P0_rxgo_0_@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-novice
Hi there, could someone drop me a hint on the whys at below?

The table:

test=# select * from texts;
    title     |         a_text
--------------+-------------------------
 A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)

Now, I want to match 'macskacicó' WORD.

It works:
test=# select * from texts where title ~* E'macskacicó';
    title     |         a_text
--------------+-------------------------
 A macskacicó | A blah blah macskacicó.
(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';
    title     |           a_text
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented
character:
test=# select * from texts where title ~* E'\\mtower\\M';
     title      |  a_text
----------------+-----------
 The dark tower | Blah blah
(1 row)

It seems that accented characters are not recognized as \w. (It
matches:  select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form
and feed it that way? Or is there a more elegant way to redefine the \w
class?

Thanks a lot!

I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives:
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Responses

pgsql-novice by date

Next:From: Thom BrownDate: 2010-06-08 09:45:48
Subject: Re: Regexp match with accented character problem
Previous:From: AndrejDate: 2010-06-08 05:48:25
Subject: Re: The Two Towers

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group