Re: snowball ASCII stemmer configuration

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: snowball ASCII stemmer configuration
Date: 2020-06-16 14:37:17
Message-ID: 1301915.1592318237@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> writes:
>> Moreover, AFAIK, the following other languages do not use Latin-based
>> alphabets:

>> arabic arabic \
>> greek greek \
>> nepali nepali \
>> tamil tamil \

> Hmm. I think all of those entries are ones that got added by me while
> absorbing post-2007 Snowball updates, and I confess that I did not think
> about this point. Maybe these should be changed.

After further reflection, I think these are indeed mistakes and we should
change them all. The argument for the Russian/English case, AIUI, is
"if we come across an all-ASCII word, it is most certainly not Russian,
and the most likely Latin-based language is English". Given the world
as it is, I think the same argument works for all non-Latin-alphabet
languages. Obviously specific applications might have a different idea
of the best fallback language, but that's why we let users make their
own text search configurations. For general-purpose use, falling back
to English seems reasonable. And we can be dead certain that applying
a Greek stemmer to an ASCII word will do nothing useful, so the
configuration choice shown above is unhelpful.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Georgios 2020-06-16 14:51:31 Use TableAm API in pg_table_size
Previous Message Tatsuo Ishii 2020-06-16 14:36:17 Re: Transactions involving multiple postgres foreign servers, take 2