Re: new function for tsquery creartion

From: Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru>
To: Dmitry Ivanov <d(dot)ivanov(at)postgrespro(dot)ru>
Cc: Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: new function for tsquery creartion
Date: 2018-04-03 14:13:20
Message-ID: 20180403171320.400cd24a@asp437-24-g082ur
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 03 Apr 2018 14:28:37 +0300
Dmitry Ivanov <d(dot)ivanov(at)postgrespro(dot)ru> wrote:
> I'm sorry, I totally forgot to fix a few more things, the patch is
> attached below.

The patch looks good to me except two things.

I'm not sure about the different result for these queries:
SELECT websearch_to_tsquery('simple', 'cat or ');
websearch_to_tsquery
----------------------
'cat'
(1 row)
SELECT websearch_to_tsquery('simple', 'cat or');
websearch_to_tsquery
----------------------
'cat' & 'or'
(1 row)

But I don't have strong opinion about these queries, since input in
both of them looks broken in terms of operator usage.

I found an odd behavior of the query creation function in case:
SELECT websearch_to_tsquery('english', '"pg_class pg"');
websearch_to_tsquery
-----------------------------
( 'pg' & 'class' ) <-> 'pg'
(1 row)

This query means that lexemes 'pg' and 'class' should be at the same
distance from the last 'pg'. In other words, they should have the same
position. But default parser will interpret pg_class as two separate
words during text parsing/vector creation.

The bug wasn't introduced in the patch and can be found in current
master. During the discussion of the patch with Dmitry, he noticed that
to_tsquery() function shares same odd behavior:
select to_tsquery('english', ' pg_class <-> pg');
to_tsquery
-----------------------------
( 'pg' & 'class' ) <-> 'pg'
(1 row)

This oddity caused by they implementation of makepol. In makepol, each
token (parsed by query parser) is sent to FTS parser and in case of
further separation of the token, it uses operator selected in functions
to_tsquery and friends. So it doesn't change over the runtime.

I see two different ways to solve it:
1) Use the same operator inside the parenthesizes. This will mean to
parse it as few parts of one word.
2) Remove parenthesizes. This will mean to parse it as few separate
words.

I prefer the second way since in some languages words can be separated
by some special symbol or not separated by any symbols at all and
should be extracted by special FTS parser. It also allows us to parse
such words as one by using the special parser (as it done for hyphenated
word).

But in the example with websearch_to_tsquery, I think it should use
the same operator for quoted part of the query. For example, we can
update the operator in makepol before sending it to pushval
(pushval_morph) to do so.

It looks like there should be two separated patches, one for
websearch_to_tsquery and another one for fixing odd behavior of the
query construction. But since the first one may depend on the
bugfix, to solve case with quotes, I will mark it as Waiting on
Author.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2018-04-03 14:20:11 Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message Claudio Freire 2018-04-03 14:09:42 Re: Vacuum: allow usage of more than 1GB of work mem