From: | Laslo Forro <getforum(at)gmail(dot)com> |
---|---|
To: | pgsql-novice(at)postgresql(dot)org |
Subject: | Regexp match with accented character problem |
Date: | 2010-06-08 08:48:53 |
Message-ID: | AANLkTinhs32woCPg8neaTb3jEqde7BRHx0P0_rxgo_0_@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
Hi there, could someone drop me a hint on the whys at below?
The table:
test=# select * from texts;
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)
Now, I want to match 'macskacicó' WORD.
It works:
test=# select * from texts where title ~* E'macskacicó';
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicó.
(1 row)
But it would also macth 'macskacicónak' string:
test=# select * from texts where a_text ~* E'macskacicó';
title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)
Now, these do not work:
test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';
(neither with one \ , nor with double.)
Now, it seems that all is ok if the string does not end with an accented
character:
test=# select * from texts where title ~* E'\\mtower\\M';
title | a_text
----------------+-----------
The dark tower | Blah blah
(1 row)
It seems that accented characters are not recognized as \w. (It
matches: select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form
and feed it that way? Or is there a more elegant way to redefine the \w
class?
Thanks a lot!
I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
'locale' gives:
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
From | Date | Subject | |
---|---|---|---|
Next Message | Thom Brown | 2010-06-08 09:45:48 | Re: Regexp match with accented character problem |
Previous Message | Andrej | 2010-06-08 05:48:25 | Re: The Two Towers |