Re: pg_trgm

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: ishii(at)sraoss(dot)co(dot)jp, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru
Subject: Re: pg_trgm
Date: 2010-05-27 18:01:01
Message-ID: 1274983261.18581.14.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:
> > I don't know about Japanese, but the locale approach works just fine for
> > other agglutinative languages. I would rather suspect that it is the
> > trigram approach that might be rather useless for such languages,
> > because you are going to get a lot of similarity hits for the affixes.
>
> I'm not sure what you mean by "affixes". But I will explain...
>
> A Japanese sentence consists of words. Problem is, each word is not
> separated by space (agglutinative). So most text tools such as text
> search need preprocess which finds word boundaries by looking up
> dictionaries (and smart grammer analysis routine). In the process
> "affixes" can be determined and perhaps removed from the target word
> group to be used for text search (note that removing affixes is no
> relevant to locale). Once we get space separated sentence, it can be
> processed by text search or by pg_trgm just same as Engligh. (Note
> that these preprocessing are done outside PostgreSQL world). The
> difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues. You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use. E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');
similarity
────────────
NaN
(1 row)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2010-05-27 18:11:51 Re: List traffic
Previous Message Josh Berkus 2010-05-27 18:00:24 Re: Idea for getting rid of VACUUM FREEZE on cold pages