Re: Flexible configuration for full-text search

From: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Flexible configuration for full-text search
Date: 2018-04-06 07:51:38
Message-ID: 20180406105138.72ed468c@asp437-manjaro
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 5 Apr 2018 17:26:10 +0300
Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> Some notices:
>
> 0) patch conflicts with last changes in gram.y, conflicts are trivial.

Yes, due to commits with MERGE command with changes in gram.y there
were some conflicts.

> 2) pg_ts_config_map.h, "jsonb mapdicts" isn't decorated with
> #ifdef CATALOG_VARLEN like other varlena columns in catalog. It it's
> right, pls, explain and add comment.

Since there is only one varlena column it is safe to use it directly. I
add a related comment about it.

> 3) I see changes in pg_catalog, including drop column, change its
> type, change index, change function etc. Did you pay attention to
> pg_upgrade? I don't see it in patch.

The full-text search configuration is migrated via FTS commands such
as CREATE TEXT SEARCH CONFIGURATION. The pg_upgrade uses pg_dump to
create a dump of this part of the catalog where
dictionary_mapping_to_text is used to get a textual representation of
the FTS configuration. Correct me if I'm wrong.

> 4) Seems, it could work:
> ALTER TEXT SEARCH CONFIGURATION russian
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part
> WITH english_stem union (russian_stem, simple);
> ^^^^^^^^^^^^^^^^^^^^^ simple way
> instead of WITH english_stem union (case russian_stem when match then
> keep else simple end);

I add such ability since it was just a little fix in grammar. I also
add tests for this kind of configurations. The test is a bit
synthetic because I used a synonym dictionary as one which doesn't
accept some input.

> 4) Initial approach suggested to distinguish three state of
> dictionary result: null (unknown word), stopword and usual word. Now
> only two, we lost possibility to catch stopwords. One of way to use
> stopwrods is: let we have to identical fts configurations, except one
> skips stopwords and another doesn't. Second configuration is used for
> indexing, and first one for search by default. But if we can't find
> anything ('to be or to be' - phrase contains stopwords only) then we
> can use second configuration. For now, we need to keep two variant of
> each dictionary - with and without stopwords. But if it's possible to
> distinguish stop and nonstop words in configuration then we don't
> need to have duplicated dictionaries.

With the proposed way to configure it is possible to create a special
dictionary only for stopword checking and use it at decision-making
time.

For example, we can create dictionary english_stopword which will
return word itself in case of stopword and NULL otherwise. With such
dictionary we create a configuration:

ALTER TEXT SEARCH CONFIGURATION test_cfg ALTER MAPPING FOR asciiword,
word WITH
CASE english_stopword WHEN NO MATCH THEN english_hunspell END;

In described example, english_hunspell can be implemented without
processing of stopwords at all and we can divide stopword processing
and processing of other words into separate dictionaries.

The key point of the patch is to process stopwords the same way as
others at the level of the PostgreSQL internals and give users an
instrument to process them in a special way via configurations.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
0001-flexible-fts-configuration-v11.patch text/x-patch 177.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-04-06 08:20:23 Re: Problem while setting the fpw with SIGHUP
Previous Message Amit Kapila 2018-04-06 07:49:20 Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key