Re: [Fwd: Re: tsearch in core patch]

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: josh(at)agliodbs(dot)com
Cc: ishii(at)sraoss(dot)co(dot)jp, tgl(at)sss(dot)pgh(dot)pa(dot)us, euler(at)timbira(dot)com, teodor(at)sigaev(dot)ru, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Fwd: Re: tsearch in core patch]
Date: 2007-06-30 23:13:26
Message-ID: 20070701.081326.41659812.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Ishii-san,
>
> >>> Ok, probably we need to copy the English stemming rule to the one for
> >>> Japanese.
> >> Pardon my ignorance here, but is the concept of stemming even relevant
> >> to Japanese/Chinese/Korean? What little I know about ideographic
> >> languages suggests it wouldn't work well. And surely the specific rules
> >> in the Snowball project's English stemmer wouldn't work.
> >
> > Your undestanding is correct. English stemmer would not work for
> > Japanese "non English" part.
>
> That reminds me, don't you guys have your own full text search for
> Japanese? Planning on merging it with the core code anytime soon?

No. Actually Japanese (non English part) does not need stemming at
all. However, since Japanese is an agglutinative language, we have to
break continuous Japanese string into space separated "words". For
example, we need to break:

todayisfine

into:

today is fine

(of course those English are just for non-Japanese spearker's
understanding, actually they are Japanese).

For this we need good dictionary and software. Fortunately we have
several kinds of open source softwares for this pupose. Once I have
written a PostgreSQL C function envoking one of these software to do
the work and it works great with tsearch2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2007-07-01 00:58:10 Re: Something is fairly whacko about shutdown in CVS HEAD
Previous Message Josh Berkus 2007-06-30 19:59:32 Re: [Fwd: Re: tsearch in core patch]