Re: Flexible configuration for full-text search

From: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Subject: Re: Flexible configuration for full-text search
Date: 2017-10-30 12:40:32
Message-ID: 20171030154032.5447672c@asp437-24-g082ur
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.

On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:
> To put it formally:
>
> ALTER TEXT SEARCH CONFIGURATION name
> ADD MAPPING FOR token_type [, ... ] WITH config
>
> where config is one of:
>
> dictionary_name
> config { UNION | INTERSECT | EXCEPT } config
> CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

According to formal definition following configurations are valid:

CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END

But configuration:

CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END

is not (as I understand ELSE can be used only with KEEP).

I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.

> > 3) Using different dictionaries for recognizing and output
> > generation. As I mentioned before, in new syntax condition and
> > command are separate and we can use it for some more complex text
> > processing. Here an example for processing only nouns:
> >
> > ALTER TEXT SEARCH CONFIGURATION nouns_only
> > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> > word, hword, hword_part WITH CASE
> > WHEN english_noun THEN english_hunspell
> > END
>
> This would also still work with the simpler syntax because
> "english_noun", still being a dictionary, would pass the tokens to the
> next one.

Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END

The question is same as in the previous example.

> Instead of supporting old way of putting stopwords on dictionaries, we
> can make them dictionaries on their own. This would then become
> something like:
>
> CASE polish_stopword
> WHEN NO MATCH THEN polish_isspell
> END

Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------
'messag':3 'test':2

A stopword 'a' has a position 1 but it is not in the vector.

If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword
WHEN MATCH THEN KEEP -- stopword counting
ELSE polish_isspell
END

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2017-10-30 13:07:48 Re: MERGE SQL Statement for PG11
Previous Message Alvaro Herrera 2017-10-30 12:37:16 Re: pow support for pgbench