Re: Rethinking the implementation of ts_headline()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, sebastian(dot)patino-lang(at)posteo(dot)net
Subject: Re: Rethinking the implementation of ts_headline()
Date: 2023-01-18 22:55:46
Message-ID: 3951261.1674082546@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> I tried this other test, based on looking at the new regression tests
> you added,

> SELECT ts_headline('english', '
> Day after day, day after day,
> We stuck, nor breath nor motion,
> As idle as a painted Ship
> Upon a painted Ocean.
> Water, water, every where
> And all the boards did shrink;
> Water, water, every where,
> Nor any drop to drink.
> S. T. Coleridge (1772-1834)
> ', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
> ts_headline
> ─────────────────────────────────────────
> motion, ↵
> As <b>idle</b> as a <b>painted</b> Ship↵
> Upon
> (1 fila)

> and was surprised that the match for the 'day & drink' arm of the OR
> disappears from the reported headline.

I'd argue that that's exactly what should happen. It's supposed to
find as-short-as-possible cover strings that satisfy the query.
In this case, satisfying the 'day & drink' condition would require
nearly the entire input text, whereas 'idle & painted' can be
satisfied in just the third line. So what you get is the third line,
slightly expanded because of some later rules that like to add
context if the cover is shorter than MaxWords. I don't find anything
particularly principled about the old behavior:

> <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
> We stuck ... motion, ↵
> As <b>idle</b> as a <b>painted</b> Ship ↵
> Upon

It's including hits for "day" into the cover despite the lack of any
nearby match to "drink".

> Another thing I think might be a regression is the way fragments are
> selected. Consider what happens if I change the "idle & painted" in the
> earlier query to "idle <-> painted", and MaxWords is kept low:

Of course, "idle <-> painted" is satisfied nowhere in this text
(the words are there, but not adjacent). So the cover has to
run from the last 'day' to the 'drink'. I think v15 is deciding
that it runs from the first 'day' to the 'drink', which while not
exactly wrong is not the shortest cover. The rest of this is just
minor variations in what mark_hl_fragments() decides to do based
on the precise cover string it's given. I don't dispute that
mark_hl_fragments() could be improved, but this patch doesn't touch
its algorithm and I think that doing so should be material for a
different patch. (I have no immediate ideas about what would be
a better algorithm for it, anyway.)

> (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> verse, which perhaps they shouldn't do, since it's not preceded by
> 'idle'.)

Yeah, and 'idle' too. Once it's chosen a string to show, it'll
highlight all the query words within that string, whether they
constitute part of the match or not. I can see arguments on both
sides of doing it that way; it was probably more sensible before
we had phrase match than it is now. But again, changing that phase
of the processing is outside the scope of this patch. I'm just
trying to undo the damage I did to the cover-string-selection phase.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-01-18 23:17:49 Re: CREATEROLE users vs. role properties
Previous Message Peter Geoghegan 2023-01-18 22:37:20 Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation