Re: Tsearch2 and Unicode?

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Markus Wollny <Markus(dot)Wollny(at)computec(dot)de>
Cc: Dawid Kuroczko <qnex42(at)gmail(dot)com>, Pgsql General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Tsearch2 and Unicode?
Date: 2004-11-22 13:48:15
Message-ID: Pine.GSO.4.61.0411221645540.24069@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.

---559023410-491009931-1101131295=:24069
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: 8BIT

Markus,

it'd be nice if you (or somebody) wrtite a note about unicode, so it
could be added to tsearch2 documentation. It will help people and save
time and hair :)

Oleg
On Mon, 22 Nov 2004, Markus Wollny wrote:

> Hi!
>
> I dug through my list-archives - I actually used to have the very same problem that you described: special chars being swallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE(at)euro as locale, whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here: http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php
> If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that's supplied with PostgreSQL.
>
> Kind regards
>
> Markus
>
>
>
>> -----Ursprngliche Nachricht-----
>> Von: pgsql-general-owner(at)postgresql(dot)org
>> [mailto:pgsql-general-owner(at)postgresql(dot)org] Im Auftrag von
>> Dawid Kuroczko
>> Gesendet: Mittwoch, 17. November 2004 17:17
>> An: Pgsql General
>> Betreff: [GENERAL] Tsearch2 and Unicode?
>>
>> I'm trying to use tsearch2 with database which is in
>> 'UNICODE' encoding.
>> It works fine for English text, but as I intend to search
>> Polish texts I did:
>>
>> insert into pg_ts_cfg('default_polish', 'default',
>> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
>> written in manual).
>>
>> However, Polish-specific chars are being eaten alive, it seems.
>> I.e. doing select to_tsvector('default_polish', body) from
>> messages; results in list of words but with national chars stripped...
>>
>> I wonder, am I doing something wrong, or just tsearch2
>> doesn't grok Unicode, despite the locales setting? This also
>> is a good question regarding ispell_dict and its feelings
>> regarding Unicode, but that's another story.
>>
>> Assuming Unicode unsupported means I should perhaps... oh,
>> convert the data to iso8859 prior feeding it to_tsvector()...
>> interesting idea, but so far I have failed to actually do
>> it. Maybe store the data as 'bytea' and add a column with
>> encoding information (assuming I don't want to recreate whole
>> database with new encoding, and that I want to use unicode
>> for some columns (so I don't have to keep encoding with every
>> text everywhere...).
>>
>> And while we are at it, how do you feel -- an extra column
>> with tsvector and its index -- would it be OK to keep it away
>> from my data (so I can safely get rid of them if need be)?
>> [ I intend to keep index of around 2 000 000 records, few KBs
>> of text each ]...
>>
>> Regards,
>> Dawid Kuroczko
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>> http://www.postgresql.org/docs/faqs/FAQ.html
>>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---559023410-491009931-1101131295=:24069--

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Scott Nixon 2004-11-22 13:55:50 Help with syntax for timestamp addition
Previous Message Matt 2004-11-22 13:40:33 Re: How to handle larger databases?