Re: Rethinking the implementation of ts_headline()

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, sebastian(dot)patino-lang(at)posteo(dot)net
Subject: Re: Rethinking the implementation of ts_headline()
Date: 2023-01-18 11:09:42
Message-ID: 20230118110942.od2naagwp6molgxz@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I tried this other test, based on looking at the new regression tests
you added,

SELECT ts_headline('english', '
Day after day, day after day,
We stuck, nor breath nor motion,
As idle as a painted Ship
Upon a painted Ocean.
Water, water, every where
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
─────────────────────────────────────────
motion, ↵
As <b>idle</b> as a <b>painted</b> Ship↵
Upon
(1 fila)

and was surprised that the match for the 'day & drink' arm of the OR
disappears from the reported headline.

This is what 15 reports for the same query:

SELECT ts_headline('english', '
Day after day, day after day,
We stuck, nor breath nor motion,
As idle as a painted Ship
Upon a painted Ocean.
Water, water, every where
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───────────────────────────────────────────────────────────
<b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
We stuck ... motion, ↵
As <b>idle</b> as a <b>painted</b> Ship ↵
Upon
(1 fila)

I think this was better.

15 seems to fail in other ways; for instance, 'drink' is not highlighted in the
headline when the OR matches, but if the other arm of the OR doesn't match, it
is; for example both 15 and master return the same for this one:

SELECT ts_headline('english', '
Day after day, day after day,
We stuck, nor breath nor motion,
As idle as a painted Ship
Upon a painted Ocean.
Water, water, every where
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (mountain & backpack)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───────────────────────────────────────────────────────────
<b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
We stuck ... drop to <b>drink</b>. ↵
S. T. Coleridge
(1 fila)

Another thing I think might be a regression is the way fragments are
selected. Consider what happens if I change the "idle & painted" in the
earlier query to "idle <-> painted", and MaxWords is kept low:

SELECT ts_headline('english', '
Day after day, day after day,
We stuck, nor breath nor motion,
As idle as a painted Ship
Upon a painted Ocean.
Water, water, every where
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───────────────────────────────────────────────
<b>day</b>, ↵
We stuck, nor breath nor motion, ↵
As <b>idle</b> ... <b>painted</b> Ship ↵
Upon a <b>painted</b> Ocean. ↵
Water, water, every ... drop to <b>drink</b>.↵
S. T. Coleridge
(1 fila)

Note that it chose to put a fragment delimiter exactly in the middle of the
phrase match, where the stop words are. If I raise MaxWords, it is of course
much better, I suppose because the word limit doesn't force a new fragment,

SELECT ts_headline('english', '
Day after day, day after day,
We stuck, nor breath nor motion,
As idle as a painted Ship
Upon a painted Ocean.
Water, water, every where
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=25, MinWords=4');
ts_headline
──────────────────────────────────────────────────
after <b>day</b>, <b>day</b> after <b>day</b>, ↵
We stuck, nor breath nor motion, ↵
As <b>idle</b> as a <b>painted</b> Ship ↵
Upon a <b>painted</b> Ocean. ↵
Water, water, every where ... boards did shrink;↵
Water, water, every where, ↵
Nor any drop to <b>drink</b>. ↵
S. T. Coleridge
(1 fila)

But in 15, the query with low MaxWords does this instead, where the
fragment delimiter occurs just *before* the phrasal match.

SELECT ts_headline('english', '
Day after day, day after day,
We stuck, nor breath nor motion,
As idle as a painted Ship
Upon a painted Ocean.
Water, water, every where
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───────────────────────────────────────────────────────────
<b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
We stuck ... <b>idle</b> as a <b>painted</b> Ship ↵
Upon a <b>painted</b> Ocean ... drop to <b>drink</b>. ↵
S. T. Coleridge
(1 fila)

(Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
verse, which perhaps they shouldn't do, since it's not preceded by
'idle'.)

(I think it's super annoying that the fragment separation algorithm
fails to preserve newlines between verses as it adds the '...'
separator. But I guess poetry is not the main use case for text search
anyway, so it probably doesn't matter much.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-01-18 11:18:49 Re: Question about initial logical decoding snapshot
Previous Message Etsuro Fujita 2023-01-18 11:06:34 Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit