Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words

From: Alex Malek <magicagent(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
Date: 2022-07-25 13:31:45
Message-ID: CAGH8cceNS=J3OJMv9y_D009hnFhZtU4YbBwp3OxYhn8TA=i0VQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sun, Jul 24, 2022 at 10:36 PM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
wrote:

> At Fri, 22 Jul 2022 14:06:43 +0000, PG Bug reporting form <
> noreply(at)postgresql(dot)org> wrote in
> > The following bug has been logged on the website:
> >
> > Bug reference: 17556
> > Logged by: Alex Malek
> > Email address: magicagent(at)gmail(dot)com
> > PostgreSQL version: 14.4
> > Operating system: Red Hat
> > Description:
> >
> > Correct results when 4,998 words separate search terms:
> >
> > # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || '
> > labor',
> > $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
> > MaxFragments=100, MaxWords=7, MinWords=3') ;
> > ts_headline
> > ---------------------
> > >ipsum< ... >labor<
> > (1 row)
> >
> > Add one more word between terms being searched for, to total 4,999, and
> > terms are not found:
> >
> > # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
> > labor',
> > $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
> > MaxFragments=100, MaxWords=7, MinWords=3') ;
> > ts_headline
> > -------------
> > baz baz baz
> > (1 row)
>
> When ts_headline searches the document, it splits the document into
> segments in the length called internally as max_cover, which is not
> configurable for now [1]. In the latter case above, it is
> MaxFragments * (max(MaxWords * 10, 100)) = 10000 "words" where
> whitespaces are counted as words. The docuement has 10007 "words",
> where 'ipsum' is the 7th word and 'labor' is the 10007th word. The two
> words aren't within a 10000-word segment so it is missed. ts_headeline
> returns instead the first MinWords words as you see.
>
> This is not a bug, but a designed behavior. However, we might want to
> document that beahvior.
>
> This could be "improved" as [1], but in this specific case, I doubt
> the usefulness of ts_headline picking up it up when the two words are
> that far distant each other, in exchange of possible degradation.
>
>
> [1] For developers, wparser_def.c:2582
> > * We might eventually make max_cover a user-settable parameter,
> but for
> > * now, just compute a reasonable value based on max_words and
> > * max_fragments.
>
>
Since the expected output is produced for much larger documents when OR
('|') replaces AND ('&'),
what if the code, when no match is found, tries again with such a
replacement?
Alternatively since the "highlighting" of terms is the same for '|' vs '&'
maybe always do the replacement?

Note: I have no idea how the parsing, max_cover etc., actually work, I am
suggesting "high level" ideas
that I realize may or may not make sense for that code base.

Correct highlighting for 100,000+ "words:" using OR ('|'):

# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',100000) || '
labor',
$$'ipsum' | 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
---------------------
>ipsum< ... >labor<
(1 row)

Highlighting the same for OR vs AND:

# select ts_headline('baz baz baz ipsum labor foo foo foo', $$'ipsum' &
'labor'$$::tsquery, 'StartSel=>, StopSel=<');
ts_headline
-----------------------------------------
baz baz baz >ipsum< >labor< foo foo foo
(1 row)

# select ts_headline('baz baz baz ipsum labor foo foo foo', $$'ipsum' |
'labor'$$::tsquery, 'StartSel=>, StopSel=<');
ts_headline
-----------------------------------------
baz baz baz >ipsum< >labor< foo foo foo
(1 row)

Best,
Alex

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2022-07-25 13:39:11 Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Previous Message David Steele 2022-07-25 12:40:12 Re: could not link file in wal restore lines