From: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, sebastian(dot)patino-lang(at)posteo(dot)net |
Subject: | Re: Rethinking the implementation of ts_headline() |
Date: | 2023-01-16 12:23:03 |
Message-ID: | 20230116122303.ddumvskoxnositjy@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2022-Nov-25, Tom Lane wrote:
> After further contemplation of bug #17691 [1], I've concluded that
> what I did in commit c9b0c678d was largely misguided. For one
> thing, the new hlCover() algorithm no longer finds shortest-possible
> cover strings: if your query is "x & y" and the text is like
> "... x ... x ... y ...", then the selected cover string will run
> from the first occurrence of x to the y, whereas the old algorithm
> would have correctly selected "x ... y". For another thing, the
> maximum-cover-length hack that I added in 78e73e875 to band-aid
> over the performance issues of the original c9b0c678d patch means
> that various scenarios no longer work as well as they used to,
> which is the proximate cause of the complaints in bug #17691.
I came across #17556 which contains a different test for this, and I'm
not sure that this patch changes things completely for the better. In
that bug report, Alex Malek presents this example
select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || 'labor',
$$'ipsum' & 'labor'$$::tsquery,
'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3'),
ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || 'labor',
$$'ipsum' & 'labor'$$::tsquery,
'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3');
which returns, in the current HEAD, the following
ts_headline │ ts_headline
─────────────────────┼─────────────
{ipsum} ... {labor} │ baz baz baz
(1 fila)
That is, once past the 5000 words of distance, it fails to find a good
cover, but before that it returns an acceptable headline. However,
after your proposed patch, we get this:
ts_headline │ ts_headline
─────────────┼─────────────
{ipsum} │ {ipsum}
(1 fila)
which is an improvement in the second case, though perhaps not as much
as we would like, and definitely not an improvement in the first case.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it." (New York Times, about Microsoft PowerPoint)
From | Date | Subject | |
---|---|---|---|
Next Message | torikoshia | 2023-01-16 12:36:59 | Record queryid when auto_explain.log_verbose is on |
Previous Message | Juan José Santamaría Flecha | 2023-01-16 12:05:12 | Re: Using AF_UNIX sockets always for tests on Windows |