Re: Rethinking the implementation of ts_headline()

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, sebastian(dot)patino-lang(at)posteo(dot)net
Subject: Re: Rethinking the implementation of ts_headline()
Date: 2023-01-16 12:23:03
Message-ID: 20230116122303.ddumvskoxnositjy@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-Nov-25, Tom Lane wrote:

> After further contemplation of bug #17691 [1], I've concluded that
> what I did in commit c9b0c678d was largely misguided. For one
> thing, the new hlCover() algorithm no longer finds shortest-possible
> cover strings: if your query is "x & y" and the text is like
> "... x ... x ... y ...", then the selected cover string will run
> from the first occurrence of x to the y, whereas the old algorithm
> would have correctly selected "x ... y". For another thing, the
> maximum-cover-length hack that I added in 78e73e875 to band-aid
> over the performance issues of the original c9b0c678d patch means
> that various scenarios no longer work as well as they used to,
> which is the proximate cause of the complaints in bug #17691.

I came across #17556 which contains a different test for this, and I'm
not sure that this patch changes things completely for the better. In
that bug report, Alex Malek presents this example

select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || 'labor',
$$'ipsum' & 'labor'$$::tsquery,
'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3'),
ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || 'labor',
$$'ipsum' & 'labor'$$::tsquery,
'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3');

which returns, in the current HEAD, the following
ts_headline │ ts_headline
─────────────────────┼─────────────
{ipsum} ... {labor} │ baz baz baz
(1 fila)

That is, once past the 5000 words of distance, it fails to find a good
cover, but before that it returns an acceptable headline. However,
after your proposed patch, we get this:

ts_headline │ ts_headline
─────────────┼─────────────
{ipsum} │ {ipsum}
(1 fila)

which is an improvement in the second case, though perhaps not as much
as we would like, and definitely not an improvement in the first case.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it." (New York Times, about Microsoft PowerPoint)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2023-01-16 12:36:59 Record queryid when auto_explain.log_verbose is on
Previous Message Juan José Santamaría Flecha 2023-01-16 12:05:12 Re: Using AF_UNIX sockets always for tests on Windows