Skip site navigation (1) Skip section navigation (2)

Re: Extending range of to_tsvector et al

From: john knightley <john(dot)knightley(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dan Scott <denials(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Extending range of to_tsvector et al
Date: 2012-10-01 04:35:04
Message-ID: CA+nPCM_rDKbS7H9XODwBKkdK3MaPt=qJiRXmiihF2giWC8zzhA@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On Mon, Oct 1, 2012 at 12:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> john knightley <john(dot)knightley(at)gmail(dot)com> writes:
>> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
>> a utf8 local
>
>> A short 5 line dictionary file  is sufficient to test:-
>
>> raeuz
>> 我们
>> 𦘭𥎵
>> 𪽖𫖂
>> 󶒘󴮬
>
>> line 1 "raeuz" Zhuang word written using English letters and show up
>> under ts_vector ok
>> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
>> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
>> found in Unicode 3.1 which came in about the year 2000  and show up
>> under ts_vector ok
>> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
>> found in Unicode 5.2 which came in about the year 2009 but do not show
>> up under ts_vector ok
>> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
>> found in PUA area of the font Sawndip.ttf but do not show up under
>> ts_vector ok (Font can be downloaded from
>> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>
> AFAIK there is nothing in Postgres itself that would distinguish, say,
> 𦘭 from 𪽖.  I think this must be down to
> your platform's locale definition: it probably thinks that the former is
> a letter and the latter is not.  You'd have to gripe to the locale
> maintainers to get that fixed.
>
>                         regards, tom lane

PostgreSQL in general does not usually distinguish but full text search does:-

 select ts_debug('𦘭 from 𪽖');

gives the result:-

                             ts_debug
-------------------------------------------------------------------
 (word,"Word, all letters",𦘭,{english_stem},english_stem,{𦘭})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all ASCII",from,{english_stem},english_stem,{})
 (blank,"Space symbols"," 𪽖",{},,)
(4 rows)

Somewhere there is dictionary, or library that is based on @ Unicode
4.0 which includes "𦘭","U+2662d" but not  "𫖂","U+2b582" which is
Unicode 5.1.

Also PUA characters are dropped in the same way by the full text
search, which is what google does but which I do not wish to do.

Regards
John


In response to

pgsql-hackers by date

Next:From: john knightleyDate: 2012-10-01 04:52:34
Subject: Re: Extending range of to_tsvector et al
Previous:From: Tom LaneDate: 2012-10-01 04:11:18
Subject: Re: Extending range of to_tsvector et al

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group