Re: Extending range of to_tsvector et al

From: Dan Scott <denials(at)gmail(dot)com>
To: john knightley <john(dot)knightley(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Extending range of to_tsvector et al
Date: 2012-10-01 03:58:11
Message-ID: CAAY5AM3d4SYKYVOO82b8urtGKkGOnRjpUbmyPfU37gC_baSY8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
<john(dot)knightley(at)gmail(dot)com> wrote:
> Dear Dan,
>
> thank you for your reply.
>
> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
> a utf8 local
>
> A short 5 line dictionary file is sufficient to test:-
>
> raeuz
> 我们
> 𦘭𥎵
> 𪽖𫖂
> 󶒘󴮬
>
> line 1 "raeuz" Zhuang word written using English letters and show up
> under ts_vector ok
> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
> found in Unicode 3.1 which came in about the year 2000 and show up
> under ts_vector ok
> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
> found in Unicode 5.2 which came in about the year 2009 but do not show
> up under ts_vector ok
> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
> found in PUA area of the font Sawndip.ttf but do not show up under
> ts_vector ok (Font can be downloaded from
> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>
> The last two words even though included in a dictionary do not get
> accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the "lc_ctype=C
lc_collate=C" options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('󶒘󴮬');
ts_debug
----------------------------------------------------------------
(word,"Word, all letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE;
lc_ctype
-------------
en_US.UTF-8
(1 row)

foobaz=# select ts_debug('󶒘󴮬');
ts_debug
---------------------------------
(blank,"Space symbols",󶒘󴮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-10-01 04:11:18 Re: Extending range of to_tsvector et al
Previous Message Peter Eisentraut 2012-10-01 03:46:33 Re: pg_upgrade tests vs alter generic changes