Re: Flexible configuration for full-text search

From: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Flexible configuration for full-text search
Date: 2018-08-29 08:38:31
Message-ID: 20180829153831.6b66d264@asp437-ThinkPad-L380
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 28 Aug 2018 12:40:32 +0700
Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru> wrote:

>On Fri, 24 Aug 2018 18:50:38 +0300
>Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>Agreed, backward compatibility is important here. Probably we should
>>leave old dictionaries for that. But I just meant that if we
>>introduce new (better) way of stop words handling and encourage users
>>to use it, then it would look strange if default configurations work
>>the old way...
>
>I agree with Alexander. The only drawback I see is that after addition
>of new dictionaries, there will be 3 dictionaries for each language:
>old one, stop-word filter for the language, and stemmer dictionary.

During work on the new version of the patch, I found an issue in
proposed syntax. At the beginning of the conversation, there was a
suggestion to split stop word filtering and words normalization. At this
stage of development, we can use a different dictionary for stop word
detection, but if we drop the word, the word counter wouldn't increase
and the stop word will be processed as an unknown word.

Currently, I see two solutions:

1) Keep the old way of stop word filtering. The drawback of this
approach is the mixing of word normalization and stop word detection
logic inside of a dictionary. It can be solved by the usage of 'simple'
dictionary in accept=false mode as a stop word filter.

2) Add an action STOPWORD to KEEP and DROP (which is not implemented in
previous patch, but I think it is good to have both of them) in the
meaning of "increase word counter but don't add lexeme to vector".

Any suggestions on the issue?

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-08-29 09:01:58 Re: Reopen logfile on SIGHUP
Previous Message Andres Freund 2018-08-29 08:37:30 Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes