From: | Dawid Kuroczko <qnex42(at)gmail(dot)com> |
---|---|
To: | Pgsql General <pgsql-general(at)postgresql(dot)org> |
Subject: | Tsearch2 and Unicode? |
Date: | 2004-11-17 16:16:32 |
Message-ID: | 758d5e7f04111708166ea575d8@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I'm trying to use tsearch2 with database which is in 'UNICODE' encoding.
It works fine for English text, but as I intend to search Polish texts I did:
insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8');
(and I updated other pg_ts_* tables as written in manual).
However, Polish-specific chars are being eaten alive, it seems.
I.e. doing select to_tsvector('default_polish', body) from messages;
results in list of words but with national chars stripped...
I wonder, am I doing something wrong, or just tsearch2 doesn't grok
Unicode, despite the locales setting? This also is a good question
regarding ispell_dict and its feelings regarding Unicode, but that's
another story.
Assuming Unicode unsupported means I should perhaps... oh, convert
the data to iso8859 prior feeding it to_tsvector()... interesting idea,
but so far I have failed to actually do it. Maybe store the data as
'bytea' and add a column with encoding information (assuming I don't
want to recreate whole database with new encoding, and that I want
to use unicode for some columns (so I don't have to keep encoding
with every text everywhere...).
And while we are at it, how do you feel -- an extra column with tsvector
and its index -- would it be OK to keep it away from my data (so I can
safely get rid of them if need be)?
[ I intend to keep index of around 2 000 000 records, few KBs of
text each ]...
Regards,
Dawid Kuroczko
From | Date | Subject | |
---|---|---|---|
Next Message | Bill Harris | 2004-11-17 16:16:56 | Moving from 7.x to 8.0beta4 with a backup |
Previous Message | Marco Bizzarri | 2004-11-17 16:01:38 | Re: Certifications in military environment |