Quick Links

Re: Extending range of to_tsvector et al

From:	john knightley <john(dot)knightley(at)gmail(dot)com>
To:	Dan Scott <denials(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Extending range of to_tsvector et al
Date:	2012-10-01 03:45:05
Message-ID:	CA+nPCM9mTszOyEda7SPwothev_0=45sgeTGOYOH3QVrf8RwAVQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Dear Dan,

thank you for your reply.

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000 and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

The last two words even though included in a dictionary do not get
accepted by ts_vector.

Regards
John

On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott <denials(at)gmail(dot)com> wrote:
> On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john(dot)knightley(at)gmail(dot)com> wrote:
>> When using to_tsvector a number of newer unicode characters and pua
>> characters are not included. How do I add the characters which I desire to
>> be found?
>
> I've just started digging into this code a bit, but from what I've
> found src/backend/tsearch/wparser_def.c defines much of the parser
> functionality, and in the area of Unicode includes a number of
> comments like:
>
> * with multibyte encoding and C-locale isw* function may fail or give
> wrong result.
> * multibyte encoding and C-locale often are used for Asian languages.
> * any non-ascii symbol with multibyte encoding with C-locale is an
> alpha character
>
> ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
> WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
> :)
>
> Also note that src/test/regress/sql/tsearch.sql and
> regress/sql/tsdicts.sql currently focus on English, ASCII-only data.
>
> Perhaps this is a good opportunity for you to describe what your
> environment looks like (OS, PostgreSQL version, encoding and locale
> settings for the database) and show some sample to_tsquery() @@
> to_tsvector() queries that don't behave the way you think they should
> behave - and we could start building some test cases as a first step?
>
> --
> Dan Scott
> Laurentian University

In response to

Re: Extending range of to_tsvector et al at 2012-10-01 03:04:24 from Dan Scott

Responses

Re: Extending range of to_tsvector et al at 2012-10-01 03:58:11 from Dan Scott
Re: Extending range of to_tsvector et al at 2012-10-01 04:11:18 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2012-10-01 03:46:33	Re: pg_upgrade tests vs alter generic changes
Previous Message	Dan Scott	2012-10-01 03:35:42	Re: Doc patch, normalize search_path in index