Re: Flexible configuration for full-text search

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Subject: Re: Flexible configuration for full-text search
Date: 2017-10-31 08:47:57
Message-ID: CAE2gYzyHtn6OF5LnKptRRodWLkOvsepnN9YUgmLRpMTVuw0mzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I'm mostly happy with mentioned modifications, but I have few questions
> to clarify some points. I will send new patch in week or two.

I am glad you liked it. Though, I think we should get approval from
more senior community members or committers about the syntax, before
we put more effort to the code.

> But configuration:
>
> CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END
>
> is not (as I understand ELSE can be used only with KEEP).
>
> I think we should decide to allow or disallow usage of different
> dictionaries for match checking (between CASE and WHEN) and a result
> (after THEN). If answer is 'allow', maybe we should allow the
> third example too for consistency in configurations.

I think you are right. We better allow this too. Then the CASE syntax becomes:

CASE config
WHEN [ NO ] MATCH THEN { KEEP | config }
[ ELSE config ]
END

> Based on formal definition it is possible to describe this example in
> following manner:
> CASE english_noun WHEN MATCH THEN english_hunspell END
>
> The question is same as in the previous example.

I couldn't understand the question.

> Currently, stopwords increment position, for example:
> SELECT to_tsvector('english','a test message');
> ---------------------
> 'messag':3 'test':2
>
> A stopword 'a' has a position 1 but it is not in the vector.

Is this problem only applies to stopwords and the whole thing we are
inventing? Shouldn't we preserve the positions through the pipeline?

> If we want to save this behavior, we should somehow pass a stopword to
> tsvector composition function (parsetext in ts_parse.c) for counter
> increment or increment it in another way. Currently, an empty lexemes
> array is passed as a result of LexizeExec.
>
> One of possible way to do so is something like:
> CASE polish_stopword
> WHEN MATCH THEN KEEP -- stopword counting
> ELSE polish_isspell
> END

This would mean keeping the stopwords. What we want is

CASE polish_stopword -- stopword counting
WHEN NO MATCH THEN polish_isspell
END

Do you think it is possible?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-31 09:15:59 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Amit Langote 2017-10-31 08:43:51 Re: path toward faster partition pruning