Re: Rethinking the implementation of ts_headline()

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, sebastian(dot)patino-lang(at)posteo(dot)net
Subject: Re: Rethinking the implementation of ts_headline()
Date: 2023-01-19 10:16:29
Message-ID: 20230119101629.vjdjktbol273zjpm@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023-Jan-18, Tom Lane wrote:

> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:

> > and was surprised that the match for the 'day & drink' arm of the OR
> > disappears from the reported headline.
>
> I'd argue that that's exactly what should happen. It's supposed to
> find as-short-as-possible cover strings that satisfy the query.

OK, that makes sense.

> I don't find anything particularly principled about the old behavior:
>
> > <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
> > We stuck ... motion, ↵
> > As <b>idle</b> as a <b>painted</b> Ship ↵
> > Upon
>
> It's including hits for "day" into the cover despite the lack of any
> nearby match to "drink".

I suppose it would be possible to put 'day' and 'drink' in two different
fragments: since the query has a & operator for them, the words don't
necessarily have to be nearby. But OK, your argument for this being the
shortest result is sensible.

I do wonder, though, if it's effectively usable for somebody building a
search interface on top. If I'm ranking the results from several
documents, and this document comes on top of others that only match one
arm of the OR query, then I would like to be able to show the matches
for both arms of the OR.

> > Another thing I think might be a regression is the way fragments are
> > selected. Consider what happens if I change the "idle & painted" in the
> > earlier query to "idle <-> painted", and MaxWords is kept low:
>
> Of course, "idle <-> painted" is satisfied nowhere in this text
> (the words are there, but not adjacent).

Oh, I see the problem, and it is my misunderstanding: the <-> operator
is counting the words in between, even if they are stop words. I
understood from the docs that those words were ignored, but that is not
so. I misread the phraseto_tsquery doc as though they were explaining
the <-> operator.

Another experiment shows that the headline becomes "complete" only if I
specify the exact distance in the <N> operator:

SELECT dist, ts_headline('simple', 'As idle as a painted Ship', to_tsquery('simple', format('(idle <%s> painted)', dist)), 'MaxFragments=5, MaxWords=8, MinWords=4') from generate_series(1, 4) dist;
dist │ ts_headline
──────┼──────────────────────────────────────
1 │ As <b>idle</b> as a
2 │ As <b>idle</b> as a
3 │ <b>idle</b> as a <b>painted</b> Ship
4 │ As <b>idle</b> as a
(4 filas)

I again have to question how valuable in practice is a <N> operator
that's so strict that I have to know exactly how many stop words I want
there to be in between the phrase search. For some reason, in my mind I
had it as "at most N words, ignoring stop words", but that's not what it
is.

Anyway, I don't think this needs to stop your current patch.

> So the cover has to run from the last 'day' to the 'drink'. I think
> v15 is deciding that it runs from the first 'day' to the 'drink',
> which while not exactly wrong is not the shortest cover.

Sounds fair.

> The rest of this is just minor variations in what mark_hl_fragments()
> decides to do based on the precise cover string it's given. I don't
> dispute that mark_hl_fragments() could be improved, but this patch
> doesn't touch its algorithm and I think that doing so should be
> material for a different patch.

Understood and agreed.

> > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> > verse, which perhaps they shouldn't do, since it's not preceded by
> > 'idle'.)
>
> Yeah, and 'idle' too. Once it's chosen a string to show, it'll
> highlight all the query words within that string, whether they
> constitute part of the match or not. I can see arguments on both
> sides of doing it that way; it was probably more sensible before
> we had phrase match than it is now. But again, changing that phase
> of the processing is outside the scope of this patch. I'm just
> trying to undo the damage I did to the cover-string-selection phase.

All clear then.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-01-19 10:22:36 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message shveta malik 2023-01-19 10:14:08 Re: Perform streaming logical transactions by background workers and parallel apply