Re: BUG #16345: ts_headline does not find phrase matches correctly

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: doitsimplefy(at)gmail(dot)com
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16345: ts_headline does not find phrase matches correctly
Date: 2020-04-09 03:02:21
Message-ID: 25072.1586401341@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> When query:
> select ts_headline(
> $$Lorem ipsum urna. Nullam nullam ullamcorper urna.$$,
> to_tsquery('Lorem') && phraseto_tsquery('ullamcorper urna'),
> 'StartSel=#$#, StopSel=#$#, FragmentDelimiter=$#$, MaxFragments=100,
> MaxWords=100, MinWords=1'
> );
> is ran, a fragment of
>> Lorem ipsum urna. Nullam nullam ullamcorper urna.
> should be returned, however, the result is a single word of #$#Lorem#$# is
> returned, meaning that ts_headline did not find the queried string.

Yeah. I spent some time digging into this, and was reminded of how
miserably baroque and undercommented almost all of the text-search code
is. Anyway, the bottom line is that wparser_def.c's hlCover() is
failing here. It is looking for a minimal cover, that is a substring
satisfying the given tsquery, and obviously with this query the only
such substring is the whole sentence. (Well, not the trailing period.)
However, it looks like hlCover() wasn't updated when we added phrase
search, because that made word position significant and invalidated
a rather fundamental assumption that hlCover() seems to be making:
it figures that any substring including all words of the query ought
to be good enough. Since "urna" also appears as the third word of the
text, hlCover() doesn't try any substrings longer than "Lorem ipsum
urna. Nullam nullam ullamcorper", thus it never finds one that
actually satisfies the query, and we end up failing.

Although utterly undocumented, the algorithm it is using seems to be
"find the latest first occurrence of any query word (here,
'ullamcorper'), then find the earliest last occurrence of any query
word before that (hence, 'Lorem'), and then see if the substring
between those points satisfies the query (oops, nope). If not,
start over from a point one past the previous try." But all the
tries after that will omit 'Lorem', so they all fail to match the
query, even though it'll eventually try substrings that include
the later occurrence of 'urna'.

I've not spent a huge amount of time thinking about it, but this might
be all right as a way to find a shortest-possible cover for queries
involving only AND and OR operators. (It'd fall down on NOT operators
of course, but it already cheats on that by telling TS_execute to
ignore NOT subtrees.) However, it's blatantly wrong as soon as a
phrase operator is involved, because then only some occurrences of a
particular word in the string might meet the query's requirements.

So I set out to rewrite hlCover to make fewer assumptions about
what a valid cover could be. In the new version appearing below,
it just tries every substring that begins and ends with some query
word, preferring earlier and shorter substrings. This should
certainly find the desired cover ... but it didn't work, plus it
broke a number of existing regression test cases. I was thus forced
to the realization that its immediate caller mark_hl_words is *also*
broken, because it's rejecting good headlines in favor of bad ones,
or even in favor of headlines that contain no cover at all. Which
was a bit daunting, because that's an even larger and uglier chunk of
undocumented code.

I ended up with the following stepwise approach to improving the
situation.

0001 below adds the problematic test case, with the wrong output
that HEAD produces. This was basically just to track which changes
affected that.

0002 makes a bunch of purely-cosmetic changes to mark_hl_words
and its siblings, in hopes of making it less unintelligible to
the next hacker. I added comments, used macros to make some of
the hairier if-tests more readable, and changed a couple of
small things for more readability (though they can be proven
not to affect the behavior). As expected, regression outputs
don't change here.

0003 fixes a couple of fairly obvious bugs. One is that there's
an early-exit optimization that tries to reject a possible headline
before having fully defined its boundaries. This is not really
necessary, but worse it's wrong because the figure of merit might
change by the time we've chosen the actual boundaries. Deleting
it doesn't change any regression test cases, but I'm sure it'd be
possible to invent a scenario where it does the wrong thing.
The other bug is that the loop that tries to shorten a maximum-length
headline until it has a good endpoint has an order-of-operations issue:
it can exit after making adjustments to curlen and poslen that discount
the i'th word from the headline, but without changing pose to actually
exclude that word. So we end up with a poslen figure-of-merit that
does not describe the actual headline from posb to pose.

Unfortunately, the one regression test output change caused by 0003
is clearly for the worse: hlCover successfully finds the cover '1 3'
for the query, but now mark_hl_words discards '3' and only highlights
'1'. What is happening is that '3' fails the "short word" test and
is thereby excluded from the headline. This behavior is clearly what
the code intends, but it was accidentally masked before by the
order-of-operations bug.

I argue that the problem here is that excluding an actual query term
from the headline on the basis of its being short is just stupid.
There is much-earlier processing that is charged with excluding
stop words from tsqueries, and this code has no business second-
guessing that. So the short-word test should only be applied to
text words that did not appear in the tsquery.

Hence, 0004 rejiggers the new BADENDPOINT() macro so that query
terms are never considered bad endpoints. That fixes the '1 <-> 3'
test case broken by 0003, but now there's a new diff: matching
'1 2 3 1 3' to '1 & 3' now selects only '1 2 3' as the headline
not '1 2 3 1'. I do not think this is a bug though. The reason
we got '1 2 3 1' before is that hlCover selected the cover '1 2 3'
but then mark_hl_words decided '3' was a bad endpoint and extended
the headline by one word. (It would've extended more, because '1'
is also a bad endpoint, except MaxWords=4 stopped it.) Again,
treating '3' as a bad endpoint is just silly, so I think this change
is acceptable.

Next, 0005 rearranges the preference order for different possible
headlines so that the first preference item is whether or not the
headline includes the cover string initially found by hlCover.
(It might not, if the cover string was longer than MaxWords.)
It seems to me to be dumb to prefer headlines that don't meet that
requirement to ones that do, because shorter headlines might not
satisfy the user's query, which surely fails to satisfy the principle
of least astonishment. (While I'm not entirely sure what can be
said about the old implementation of hlCover, with my rewrite it
is *certain* that substrings not including the full cover won't
satisfy the query.) 0005 also gets rid of what seems to me to
be a corner-case bug in the old preference logic, which is that
it will take a headline with fewer query words over one with more,
if the former has a "good" endpoint and the latter doesn't. That
makes the actual preference order close to unexplainable.

0005 doesn't in itself change any regression results, but it's
necessary to prevent problems from appearing with the next patch.
The real situation here, as I've come to understand it, is that
the existing hlCover code frequently produces only one possible cover
and thus it doesn't matter how silly are mark_hl_words's rules for
preferring one over another. The rewrite causes hlCover to produce
more candidate covers and so it becomes more important for
mark_hl_words to make sane decisions.

Lastly, 0006 introduces the new hlCover code. This at last fixes
the test case for the bug at hand. It also introduces two new diffs.
One is this change in one Rime of the Ancient Mariner example:

- <b>painted</b> <b>Ocean</b>. +
- Water, water, every where +
- And all the boards did shrink;+
- Water, water, every

+ <b>painted</b> Ship +
+ Upon a <b>painted</b> <b>Ocean</b>.+
+ Water, water, every where +
+ And all the boards did shrink

I don't see any way that that's not an improved match, given that
the query is 'painted <-> Ocean'; including another match to one
of the query words is surely better than not doing so. The other
change is that matching '1 2 3 1 3' to '1 <-> 3' now selects '3 1 3'
not just '1 3'. These are basically both the same change. The
reason for it is that the old hlCover would *only* find the cover
'painted Ocean' (or '1 3'). The new hlCover finds that, but it
also finds 'painted ... painted Ocean' (or '3 1 3'), and then the
preference metric for more query words likes this option better.

So my feeling is that these changes are for the better and we shouldn't
complain. We could perhaps make them go away if we changed the
preference rules some more, for example by preferring headlines that
use shorter covers instead of (or at least ahead of) those having more
query words. But ISTM that would actually be a bigger change from the
current behavior, so likely it would create new changes in other query
results. Besides, there's already a preference for shorter covers in
hlCover, so I don't feel like we need another one at the calling level.

In short then, I propose applying 0001-0006. I'm not quite sure
if we should back-patch, or just be content to fix this in HEAD.
But there's definitely an argument that this has been broken since
we added phrase search (in 9.6) and deserves to be back-patched.

(BTW, I wonder if we could now undo the hack to ignore NOT
restrictions while finding covers. I haven't tried it though.)

regards, tom lane

Attachment Content-Type Size
0001-add-test-case-with-wrong-answer.patch text/x-diff 2.2 KB
0002-headline-cosmetic-improvements.patch text/x-diff 8.7 KB
0003-headline-minor-bug-fixes.patch text/x-diff 1.5 KB
0004-redefine-bad-endpoint.patch text/x-diff 1.5 KB
0005-consider-whether-cover-is-included.patch text/x-diff 1.7 KB
0006-rewrite-hlCover.patch text/x-diff 6.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-04-09 06:00:39 BUG #16353: pg_isready timeout is ignored
Previous Message Kyotaro Horiguchi 2020-04-09 02:26:57 Re: [BUG] non archived WAL removed during production crash recovery

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexandra Wang 2020-04-09 03:17:55 Re: Report error position in partition bound check
Previous Message Kyotaro Horiguchi 2020-04-09 02:26:57 Re: [BUG] non archived WAL removed during production crash recovery