[PROPOSAL] Text search configuration extension

From: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [PROPOSAL] Text search configuration extension
Date: 2017-08-18 12:30:38
Message-ID: 20170818153038.6e40e876@asp437-24-g082ur
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers!

I'm working on a new approach in text search configuration and want to
share my thought with community in order to get some feedback and maybe
some new ideas.

Nowadays we can't configure text search engine in Postgres for some
useful scenarios such as multi-language search or exact and
morphological search in one configuration. Additionally, we can't use
dictionaries as a filter-dictionary if it wasn't taken into
consideration during dictionary development. Also I think to split
result set building configuration and command selection configuration.
The last but not the least goal is to keep backward compatibility in
terms of syntax and behavior in currently available scenarios.

In order to meet mentioned goals I propose following syntax for text
search configurations (current syntax could be used as well):

ALTER TEXT SEARCH CONFIGURATION <configuration> ADD/ALTER MAPPING FOR
<token_list> WITH
CASE
WHEN <condition> THEN <command>
<...>
[ELSE <command>]
END;

A <condition> is an expression with dictionary names used as operands
and boolean operators AND, OR and NOT. Additionally, after dictionary
name there could be options for result check IS [NOT] NULL or IS [NOT]
STOP. If there is no check-options for a dictionary, it will be
evaluated as:
dict IS NOT NULL and dict IS NOT STOP

A <command> is an expression on sets of lexemes with support of
operators UNION, EXCEPT, INTERSECT and MAP BY. A MAP BY operator is a
way to configure filter-dictionaries, so the output of the righthand
subexpression used as an input of lefthand subexpression. In other
words, MAP BY operator used instead of TSL_FILTER flagged output.

An example of configuration for both English and German search:

ALTER TEXT SEARCH CONFIGURATION en_de_search ADD MAPPING FOR asciiword,
word WITH
CASE
WHEN english_hunspell IS NOT NULL THEN english_hunspell
WHEN german_hunspell IS NOT NULL THEN german_hunspell
ELSE
-- stem dictionaries can't be used for language detection
english_stem UNION german_stem
END;

And example with unaccent:

ALTER TEXT SEARCH CONFIGURATION german_unaccent ADD MAPPING FOR
asciiword, word WITH
CASE
WHEN german_hunspell IS NOT NULL THEN german_hunspell MAP BY unaccent
ELSE
german_stem MAP BY unaccent
END;

In the last example the input for german_hunspell is replaced by output
of the unaccent if it is not NULL. If dictionary returns more than one
lexeme, each lexeme processed independently.

I'm not sure should we provide ability to use MAP BY operator in
condition, since MAP BY operates on sets and condition is a boolean
expression. I think to allow this with restriction on obligatory place
it inside parenthesis with check-options. Something like:

(german_hunspell MAP BY unaccent) IS NOT NULL

Because this type of check can be useful in some situations, but we
should isolate set-related subexpression.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chris Travers 2017-08-18 13:15:11 Re: Proposal: global index
Previous Message Claudio Freire 2017-08-18 11:39:15 Re: Vacuum: allow usage of more than 1GB of work mem