Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: obartunov(at)gmail(dot)com
Cc: Jean-Pierre Pelletier <jppelletier(at)e-djuster(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?
Date: 2016-06-08 21:44:11
Message-ID: 11252.1465422251@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Oleg Bartunov <obartunov(at)gmail(dot)com> writes:
> On Wed, Jun 8, 2016 at 1:05 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I concur that that seems like a rather useless behavior. If we have
>> "x <-> y" it is not possible to match at distance zero, while if we
>> have "x <-> x" it seems unlikely that the user is expecting us to
>> treat that identically to "x". So phrase search simply should not
>> consider distance-zero matches.

> what's about word with several infinitives

> select to_tsvector('en', 'leavings');
> to_tsvector
> ------------------------
> 'leave':1 'leavings':1
> (1 row)

> select to_tsvector('en', 'leavings') @@ 'leave <0> leavings'::tsquery;
> ?column?
> ----------
> t
> (1 row)

Hmm. I can grant that there might be some cases where you want to see
if two separate patterns match the same lexeme, but that seems like an
extremely specialized use-case that you would only invoke very
intentionally. It should not be built in as part of the default behavior
of every phrase search, because 99% of the time this would be an
unexpected and unwanted match. I'm not even convinced that the operator
for this should be spelled <0> --- that seems more like a hack than a
natural extension of phrase search. But if we do spell it like that,
then I think it should be called out as a special case that only applies
to <0>; that is, for any other value of N, the match has to be to separate
lexemes.

This brings up something else that I am not very sold on: to wit,
do we really want the "less than or equal" distance behavior at all?
The documentation gives the example that
phraseto_tsquery('cat ate some rats')
produces
( 'cat' <-> 'ate' ) <2> 'rat'
because "some" is a stopword. However, that pattern will also match
"cat ate rats", which seems surprising and unexpected to me; certainly
it would surprise a user who did not realize that "some" is a stopword.

So I think there's a reasonable case for decreeing that <N> should only
match lexemes *exactly* N apart. If we did that, we would no longer have
the misbehavior that Jean-Pierre is complaining about, and we'd not need
to argue about whether <0> needs to be treated specially.

Or maybe we need two operators, one for exactly-N-apart and one for
at-most-N-apart.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-06-08 21:47:57 Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?
Previous Message Bruce Momjian 2016-06-08 21:36:08 Re: Use of index for 50% column restriction