Re: contrib/tsearch

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: Hackers <pgsql-hackers(at)postgresql(dot)org>, <martin_porter(at)softhome(dot)net>, Teodor Sigaev <teodor(at)stack(dot)net>
Subject: Re: contrib/tsearch
Date: 2002-09-06 10:52:11
Message-ID: Pine.GSO.4.44.0209061348260.13637-100000@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:

> > Should we check for stop words before stemming or after ?
>
> I think you should.
>
> > In the first case we have to collect all forms of stop-words
> > which is doable
> > but difficult to maintain, in latter - we'll have current problem.
>
> Looking at the list of stopwords you sent me, Oleg, there are only about 1
> out of the list of 120 stopwords that need to have all word forms added. I
> also don't think it'll be a maintenance problem. The reason I think this is
> because stopwords in general don't have different word forms.
>
> eg. her, his, i, and, etc. They don't have different forms. In fact, the
> _only_ word in the stopword list that needs a different form is yourself and
> yourselves. Actually, according to dictionary.com 'ourself' is also a word.
> 'themself' isn't tho. Some others I don't know about are:
>
> 'veri' - I assume this is stemmed 'very', so why not just use 'very'?

That's because we currently check for stop word after stemming and
I think porters algorithm converts 'very' to 'veri' :-)

>
> So, why don't you change tsearch to check for stop words _before_ stemming?
> I can give you a list of revised stopwords that haven't been stemmed, with
> all forms of the words.
>

I agree that english list is, probably, easy to maintain, but what about
other languages ? We don't have any volunteers - you're the first one.

> > It's time for beta1 and I'm not sure if we could work on this issue
> > right now, but I feel a big pressure from tsearch users :-)
> > If people want to help us why not to work on stop words list including
> > all forms ? In any case, we are not native english, so don't expect we'll
> > create more or less decent list. Programming changes are trivial, probably
> > we'll end for the moment just using compile time option.
> > As always, your patches are welcome !
>
> I'm happy to work on the list of stopwords for you, Oleg. I agree this
> might be 7.4 thing though...

We always could keep updates separately on our page and in CVS.

>
> Chris
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tim Knowles 2002-09-06 10:54:37 7.3beta1 DROP COLUMN DEPENDENCY PROBLEM
Previous Message Oleg Bartunov 2002-09-06 10:46:00 Re: contrib/tsearch