Flexible configuration for full-text search

From: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Cc: a(dot)zakirov(at)postgrespro(dot)ru
Subject: Flexible configuration for full-text search
Date: 2017-10-19 14:24:09
Message-ID: 20171019172409.731f52a7@asp437-24-g082ur
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hello hackers,

Arthur Zakirov and I are working on a patch to introduce more flexible
way to configure full-text search in PostgreSQL because current syntax
doesn't allow a variety of scenarios to be handled. Additionally, some
parts contain the implicit logic of the processing, such as filtering
dictionaries with TSL_FILTER flag, so configuration partially moved to
dictionary itself and in most of the cases hardcoded into dictionary.
One more drawback of current FTS configuration is that we can't divide
the dictionary selection and output producing, so we can't configure
FTS to use one dictionary if another one recognized a token (e.g. use
hunspell if dictionary of nouns recognized a token).

Basically, the key goal of the patch is to provide user more
control on processing of the text.

The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
construction. Current comma-separated list also available to meet
compatibility. The basic form of new syntax is following:

ALTER TEXT SEARCH CONFIGURATION <fts_conf>
ALTER MAPPING FOR <token_types> WITH
CASE
WHEN <condition> THEN <command>
....
[ ELSE <command> ]
END;

A condition is a logical expression on dictionaries. You can specify how
to interpret dictionary output with
dictionary IS [ NOT ] NULL - for NULL-result
dictionary IS [ NOT ] STOPWORD - for empty (stopword) result

If interpretation marker is not given it is interpreted as:
dictionary IS NOT NULL AND dictionary IS NOT STOPWORD

A command is an expression on dictionaries output sets with operators
UNION, EXCEPT and INTERSECT. Additionally, there is a special operator
MAP BY which allow us to create the same behavior as with filtering
dictionaries. MAP BY operator get output of the right subexpression and
send it to left subexpression as an input token (if there are more than
one lexeme each one is sent separately).

There is a few example of usage of new configuration and comparison
with solutions using current syntax.

1) Multilingual search. Can be used for FTS on a set of documents in
different languages (example for German and English languages).
ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell AND german_hunspell THEN
english_hunspell UNION german_hunspell
WHEN english_hunspell THEN english_hunspell
WHEN german_hunspell THEN german_hunspell
ELSE german_stem UNION english_stem
END;

With old configuration we should use separate vector and index for each
required language and query should combine result of search for each
language:
SELECT * FROM en_de_documents WHERE
to_tsvector('english', text) @@ to_tsquery('english', 'query')
OR
to_tsvector('german', text) @@ to_tsquery('german', 'query');

The new multilingual search configuration itself looks more complex but
allow to avoid a split of index and vectors. Additionally, for
similar languages or configurations with simple or *_stem dictionaries
in the list we can reduce total size of index since in
current-state example index for English configuration also will
keep data about documents written in German and vice-versa.

2) Combination of exact search with morphological one. This patch not
fully solve the problem but it is a step toward solution. Currently, we
should split exact and morphological search in query manually and use
separate index for each part. With new way to configure FTS we can use
following configuration:
ALTER TEXT SEARCH CONFIGURATION exact_and_morph
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell THEN english_hunspell UNION simple
ELSE english_stem UNION simple
END;

Some of the queries like "'looking' <1> through" where 'looking' is
search for exact form of the word doesn't work in current-state FTS
since we can guarantee that document contains both 'looking' and
through, but can't be sure with distance between them.

Unfortunately, we can't fully support such queries with current format
of tsvector because after processing we can't distinguish is a word
was mentioned in normal form in text or was processed by some
dictionary. This leads to false positive hits if user searches for
the normal form of the word. I think we should provide a user ability
to mark dictionary something like "exact form producer". But without
tsvector modification this mark is useless since we can't mark output
of this dictionary in tsvector.

There is a patch on commitfest which removes 1MB limit on tsvector [1].
There are few free bits available in each lexeme in vector, so
one of the bits may be used for "exact" flag.

3) Using different dictionaries for recognizing and output generation.
As I mentioned before, in new syntax condition and command are separate
and we can use it for some more complex text processing. Here an
example for processing only nouns:
ALTER TEXT SEARCH CONFIGURATION nouns_only
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_noun THEN english_hunspell
END;

This behavior couldn't be reached with the current state of FTS.

4) Special stopword processing allows us to discard stopwords even if
the main dictionary doesn't support such feature (in example pl_ispell
dictionary keeps stopwords in text):
ALTER TEXT SEARCH CONFIGURATION pl_without_stops
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
END;

The patch is in attachment. I'm will be glad to hear hackers' opinion
about it.

There are several cases discussed in hackers earlier:

Check for stopwords using non-target dictionary.
https://www.postgresql.org/message-id/4733B65A.9030707@students.mimuw.edu.pl

Support union of outputs of several dictionaries.
https://www.postgresql.org/message-id/c6851b7e-da25-3d8e-a5df-022c395a11b4%40postgrespro.ru

Support of chain of dictionaries using MAP BY operator.
https://www.postgresql.org/message-id/46D57E6F.8020009%40enterprisedb.com

[1] Remove 1MB size limit in tsvector
https://commitfest.postgresql.org/15/1221/

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
0001-flexible-fts-configuration.patch text/x-patch 164.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Geoff Winkless 2017-10-19 14:31:24 Re: Cursor With_Hold Performance Workarounds/Optimization
Previous Message Robert Haas 2017-10-19 14:21:12 Re: [COMMITTERS] pgsql: Fix traversal of half-frozen update chains