Re: PATCH: Update snowball stemmers

From: Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: PATCH: Update snowball stemmers
Date: 2018-09-25 11:45:08
Message-ID: 20180925114506.GA14666@zakirov.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 24, 2018 at 05:36:39PM -0400, Tom Lane wrote:
> I reviewed and pushed this.

Great! Thank you.

> As a cross-check on the patch, I cloned the Snowball github repo
> and built the derived files in it. I noticed that they'd incorporated
> several new stemmers since 2007 --- not only your Nepali one, but
> half a dozen more besides. Since the point here is (IMO) mostly to
> follow their lead on what's interesting, I went ahead and added those
> as well.

Agree. It is good decision. It may attract more users.

> Although I added nepali.stop from the other patch, I've not done
> anything about updating our other stopword lists. Presumably those
> are a bit obsolete by now as well. I wonder if we can prevail on
> the Snowball people to make those available in some less painful way
> than scraping them off assorted web pages. Ideally they'd stick them
> into their git repo ...

They have repository snowball-website [1]. It is snowballstem.org
web-site source repository. It also stores stopwords for various
languages (for example for english [2]). I checked couple languages. It
seems their russian and danish stopword lists look like PostgreSQL's
stopword lists. But their english stopword list is different.

There is lack of stopword lists for the following languages:
- arabic
- irish
- lithuanian
- nepali - I can create a pull request to add it to snowball-website
- tamil

There is also another project, called Stopwords ISO [3]. But I'm not
sure about them. It stores stopword lists from various sources. And also
there are stopwords for languages mentioned above, except for nepali and
tamil.

I think I could make a script, which generates stopwords from
snowball-website repository. It can be run periodically. Is it suitable?
Also it would be good to move missing stopwords from Stopwords ISO to
snowball-website...

1 - https://github.com/snowballstem/snowball-website/tree/master/algorithms
2 - https://github.com/snowballstem/snowball-website/blob/master/algorithms/english/stop.txt
3 - https://github.com/stopwords-iso

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2018-09-25 11:46:22 Re: Collation versioning
Previous Message Dmitry Dolgov 2018-09-25 11:39:59 Re: Segfault when creating partition with a primary key and sql_drop trigger exists