Re: Tsearch2 and Unicode?

From: "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de>
To: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>, "Pgsql General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Tsearch2 and Unicode?
Date: 2004-11-22 13:22:50
Message-ID: 2266D0630E43BB4290742247C8910575068B75A3@dozer.computec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

I dug through my list-archives - I actually used to have the very same problem that you described: special chars being swallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE(at)euro as locale, whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here: http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php
If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that's supplied with PostgreSQL.

Kind regards

Markus

> -----Ursprüngliche Nachricht-----
> Von: pgsql-general-owner(at)postgresql(dot)org
> [mailto:pgsql-general-owner(at)postgresql(dot)org] Im Auftrag von
> Dawid Kuroczko
> Gesendet: Mittwoch, 17. November 2004 17:17
> An: Pgsql General
> Betreff: [GENERAL] Tsearch2 and Unicode?
>
> I'm trying to use tsearch2 with database which is in
> 'UNICODE' encoding.
> It works fine for English text, but as I intend to search
> Polish texts I did:
>
> insert into pg_ts_cfg('default_polish', 'default',
> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
> written in manual).
>
> However, Polish-specific chars are being eaten alive, it seems.
> I.e. doing select to_tsvector('default_polish', body) from
> messages; results in list of words but with national chars stripped...
>
> I wonder, am I doing something wrong, or just tsearch2
> doesn't grok Unicode, despite the locales setting? This also
> is a good question regarding ispell_dict and its feelings
> regarding Unicode, but that's another story.
>
> Assuming Unicode unsupported means I should perhaps... oh,
> convert the data to iso8859 prior feeding it to_tsvector()...
> interesting idea, but so far I have failed to actually do
> it. Maybe store the data as 'bytea' and add a column with
> encoding information (assuming I don't want to recreate whole
> database with new encoding, and that I want to use unicode
> for some columns (so I don't have to keep encoding with every
> text everywhere...).
>
> And while we are at it, how do you feel -- an extra column
> with tsvector and its index -- would it be OK to keep it away
> from my data (so I can safely get rid of them if need be)?
> [ I intend to keep index of around 2 000 000 records, few KBs
> of text each ]...
>
> Regards,
> Dawid Kuroczko
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Matt 2004-11-22 13:40:33 Re: How to handle larger databases?
Previous Message Ian Barwick 2004-11-22 12:51:04 Re: Oid to text...