| From: | Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru> | 
|---|---|
| To: | Emre Hasegeli <emre(at)hasegeli(dot)com> | 
| Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> | 
| Subject: | Re: Flexible configuration for full-text search | 
| Date: | 2017-10-30 12:40:32 | 
| Message-ID: | 20171030154032.5447672c@asp437-24-g082ur | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.
On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:
> To put it formally:
> 
> ALTER TEXT SEARCH CONFIGURATION name
>     ADD MAPPING FOR token_type [, ... ] WITH config
> 
> where config is one of:
> 
>     dictionary_name
>     config { UNION | INTERSECT | EXCEPT } config
>     CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END
According to formal definition following configurations are valid:
CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END
But configuration:
CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END
is not (as I understand ELSE can be used only with KEEP).
I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.
> > 3) Using different dictionaries for recognizing and output
> > generation. As I mentioned before, in new syntax condition and
> > command are separate and we can use it for some more complex text
> > processing. Here an example for processing only nouns:
> >
> > ALTER TEXT SEARCH CONFIGURATION nouns_only
> >   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> >                     word, hword, hword_part WITH CASE
> >   WHEN english_noun THEN english_hunspell
> > END  
> 
> This would also still work with the simpler syntax because
> "english_noun", still being a dictionary, would pass the tokens to the
> next one.
Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END
The question is same as in the previous example.
> Instead of supporting old way of putting stopwords on dictionaries, we
> can make them dictionaries on their own.  This would then become
> something like:
> 
>     CASE polish_stopword
>         WHEN NO MATCH THEN polish_isspell
>     END
Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------
 'messag':3 'test':2
A stopword 'a' has a position 1 but it is not in the vector.
If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.
One of possible way to do so is something like:
CASE polish_stopword
    WHEN MATCH THEN KEEP -- stopword counting
    ELSE polish_isspell
END
-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Simon Riggs | 2017-10-30 13:07:48 | Re: MERGE SQL Statement for PG11 | 
| Previous Message | Alvaro Herrera | 2017-10-30 12:37:16 | Re: pow support for pgbench |