Re: Rethinking the implementation of ts_headline()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, sebastian(dot)patino-lang(at)posteo(dot)net
Subject: Re: Rethinking the implementation of ts_headline()
Date: 2023-01-19 16:13:30
Message-ID: 4066528.1674144810@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> On 2023-Jan-18, Tom Lane wrote:
>> It's including hits for "day" into the cover despite the lack of any
>> nearby match to "drink".

> I suppose it would be possible to put 'day' and 'drink' in two different
> fragments: since the query has a & operator for them, the words don't
> necessarily have to be nearby. But OK, your argument for this being the
> shortest result is sensible.

> I do wonder, though, if it's effectively usable for somebody building a
> search interface on top. If I'm ranking the results from several
> documents, and this document comes on top of others that only match one
> arm of the OR query, then I would like to be able to show the matches
> for both arms of the OR.

The fundamental problem with the case you're posing is that MaxWords
is too small to allow the 'day & drink' match to be shown as a whole.
If you make MaxWords large enough then you do find it including
(and highlighting) 'drink', but I'm not sure we should stress out
about what happens otherwise.

> Oh, I see the problem, and it is my misunderstanding: the <-> operator
> is counting the words in between, even if they are stop words.

Yeah. AFAICS this is a very deliberate, longstanding decision,
so I'm hesitant to change it. Your test case with 'simple'
proves little, because there are no stop words in 'simple':

regression=# select to_tsvector('simple', 'As idle as a painted Ship');
to_tsvector
----------------------------------------------
'a':4 'as':1,3 'idle':2 'painted':5 'ship':6
(1 row)

However, when I switch to 'english':

regression=# select to_tsvector('english', 'As idle as a painted Ship');
to_tsvector
----------------------------
'idl':2 'paint':5 'ship':6
(1 row)

the stop words are gone, but the recorded positions remain the same.
So this is really a matter of how to_tsvector chooses to count word
positions, it's not the fault of the <-> construct in particular.

I'm not convinced that this particular behavior is wrong, anyway.
The user of text search isn't supposed to have to think about
which words are stopwords or not, so I think that it's entirely
sensible for 'idle as a painted' to match 'idle <3> painted'.
Maybe the docs need some adjustment? But in any case that's
material for a different thread.

> I again have to question how valuable in practice is a <N> operator
> that's so strict that I have to know exactly how many stop words I want
> there to be in between the phrase search. For some reason, in my mind I
> had it as "at most N words, ignoring stop words", but that's not what it
> is.

Yeah, I recall discussing "up to N words" semantics for this in the
past, but nobody has made that happen.

> Anyway, I don't think this needs to stop your current patch.

Many thanks for looking at it!

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-01-19 16:17:24 Re: Re: Support plpgsql multi-range in conditional control
Previous Message Takamichi Osumi (Fujitsu) 2023-01-19 16:06:14 RE: Modify the document of Logical Replication configuration settings