Re: Extending range of to_tsvector et al

From: john knightley <john(dot)knightley(at)gmail(dot)com>
To: Dan Scott <denials(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Extending range of to_tsvector et al
Date: 2012-10-01 04:52:34
Message-ID: CA+nPCM-YXLLSszLW9Q_urCjzwnfkvFJNWYxcsfvvsB86fVJa-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials(at)gmail(dot)com> wrote:
> Hi John:
>
> On Sun, Sep 30, 2012 at 11:45 PM, john knightley
> <john(dot)knightley(at)gmail(dot)com> wrote:
>> Dear Dan,
>>
>> thank you for your reply.
>>
>> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
>> a utf8 local
>>
>> A short 5 line dictionary file is sufficient to test:-
>>
>> raeuz
>> 我们
>> 𦘭𥎵
>> 𪽖𫖂
>> 󶒘󴮬
>>
>> line 1 "raeuz" Zhuang word written using English letters and show up
>> under ts_vector ok
>> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
>> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
>> found in Unicode 3.1 which came in about the year 2000 and show up
>> under ts_vector ok
>> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
>> found in Unicode 5.2 which came in about the year 2009 but do not show
>> up under ts_vector ok
>> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
>> found in PUA area of the font Sawndip.ttf but do not show up under
>> ts_vector ok (Font can be downloaded from
>> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>>
>> The last two words even though included in a dictionary do not get
>> accepted by ts_vector.
>
> Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
> work using the default text search configuration (albeit with one
> crucial note: I created the database with the "lc_ctype=C
> lc_collate=C" options):
>
> WORKING:
>
> createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
> foobar=# select ts_debug('󶒘󴮬');
> ts_debug
> ----------------------------------------------------------------
> (word,"Word, all letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
> (1 row)
>
> NOT WORKING AS EXPECTED:
>

>
> foobaz=# SHOW LC_CTYPE;
> lc_ctype
> -------------
> en_US.UTF-8
> (1 row)
>
> foobaz=# select ts_debug('󶒘󴮬');
> ts_debug
> ---------------------------------
> (blank,"Space symbols",󶒘󴮬,{},,)
> (1 row)
>
> So... perhaps LC_CTYPE=C is a possible workaround for you?

LC_CTYPE would not be a work around - this database needs to be in
utf8 , the full text search is to be used for a mediawiki. Is this a
bug that is being worked on?

Regards
John

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2012-10-01 05:04:00 Re: Question regarding Sync message and unnamed portal
Previous Message john knightley 2012-10-01 04:35:04 Re: Extending range of to_tsvector et al