Rethinking the implementation of ts_headline()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: sebastian(dot)patino-lang(at)posteo(dot)net
Subject: Rethinking the implementation of ts_headline()
Date: 2022-11-25 19:52:15
Message-ID: 840.1669405935@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

After further contemplation of bug #17691 [1], I've concluded that
what I did in commit c9b0c678d was largely misguided. For one
thing, the new hlCover() algorithm no longer finds shortest-possible
cover strings: if your query is "x & y" and the text is like
"... x ... x ... y ...", then the selected cover string will run
from the first occurrence of x to the y, whereas the old algorithm
would have correctly selected "x ... y". For another thing, the
maximum-cover-length hack that I added in 78e73e875 to band-aid
over the performance issues of the original c9b0c678d patch means
that various scenarios no longer work as well as they used to,
which is the proximate cause of the complaints in bug #17691.

What I'm now thinking is that the original hlCover() algorithm was
fine (if underdocumented) *as long as it's only asked to deal with
AND-like semantics*. Given that restriction, its approach of
finding the latest first occurrence of any query term, and then
backing up to the earliest last occurrence, is visibly correct;
and we weren't hearing of any performance issues with it either.
The problem is that this approach fails miserably for tsqueries
that aren't pure ANDs, because it will insist on including query
terms that aren't required for a match and indeed could prevent a
match.

So what we need is to find a way to fold the other sorts of queries
into an AND context. After a couple of false starts, I came up
with the attached patch. It builds on the existing TS_phrase_execute
infrastructure, which already produces an ExecPhraseData structure
containing an exact list of the match locations for a phrase subquery.
It's easy to get an ExecPhraseData structure for a single-lexeme
match, too. We can handle plain ANDs by forming lists of
ExecPhraseData structs (with implicit AND semantics across the list).
We can handle ORs by union'ing the component ExecPhraseData structs.
And we can handle NOTs by just dropping them, because we don't want
ts_headline to prioritize matches to such words. There are some fine
points but it all seems to work pretty well.

Hence I propose the attached patchset. 0001 adds some test cases that
I felt necessary after examining code coverage reports; I split this
out mainly so that 0002 clearly shows the behavioral changes from
current code. 0002 adds TS_execute_locations() which does what I just
described, and rewrites hlCover() into something that's a spiritual
descendant of the original algorithm. It can't be quite the same
as before because the location data that TS_execute_locations() returns
is measured in lexeme indexes not word indexes, so some translation is
needed.

Notable here is that a couple of regression test results revert
to what they were before c9b0c678d. They're both instances of
preferring "x ... x ... y" to "x ... y", which I argued was okay
at the time, but I now see the error in that.

Although we back-patched c9b0c678d, I'm inclined to propose this
for HEAD only. The misbehaviors it's fixing are less bad than what
we were dealing with in that patch.

BTW, while experimenting with this I realized that tsvector's limitation
of lexeme indexes to be at most 16383 is really quite disastrous for
phrase searches. That limitation was arguably okay before we had phrase
searching, but now it seems untenable. For example, tsvector entries like
"foo:16383 bar:16383" will not match "foo <-> bar" because the phrase
match code wants their lexeme positions to differ by one. This basically
means that phrase searches do not work beyond ~20K words into a document.
I'm not sure what to do about that exactly, but I think we need to do
something.

Whatever we might do about that would be largely orthogonal to this
patch, in any case. I'll stick this in the January CF.

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/17691-93cef39a14663963%40postgresql.org

Attachment Content-Type Size
0001-add-ts_headline-test-cases.patch text/x-diff 7.6 KB
0002-revise-ts_headline-implementation.patch text/x-diff 22.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-11-25 19:58:19 Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency
Previous Message Pavel Borisov 2022-11-25 18:52:54 Re: Lockless queue of waiters in LWLock