Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: magicagent(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
Date: 2022-07-25 02:36:08
Message-ID: 20220725.113608.1175924917662229386.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

At Fri, 22 Jul 2022 14:06:43 +0000, PG Bug reporting form <noreply(at)postgresql(dot)org> wrote in
> The following bug has been logged on the website:
>
> Bug reference: 17556
> Logged by: Alex Malek
> Email address: magicagent(at)gmail(dot)com
> PostgreSQL version: 14.4
> Operating system: Red Hat
> Description:
>
> Correct results when 4,998 words separate search terms:
>
> # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || '
> labor',
> $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
> MaxFragments=100, MaxWords=7, MinWords=3') ;
> ts_headline
> ---------------------
> >ipsum< ... >labor<
> (1 row)
>
> Add one more word between terms being searched for, to total 4,999, and
> terms are not found:
>
> # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
> labor',
> $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
> MaxFragments=100, MaxWords=7, MinWords=3') ;
> ts_headline
> -------------
> baz baz baz
> (1 row)

When ts_headline searches the document, it splits the document into
segments in the length called internally as max_cover, which is not
configurable for now [1]. In the latter case above, it is
MaxFragments * (max(MaxWords * 10, 100)) = 10000 "words" where
whitespaces are counted as words. The docuement has 10007 "words",
where 'ipsum' is the 7th word and 'labor' is the 10007th word. The two
words aren't within a 10000-word segment so it is missed. ts_headeline
returns instead the first MinWords words as you see.

This is not a bug, but a designed behavior. However, we might want to
document that beahvior.

This could be "improved" as [1], but in this specific case, I doubt
the usefulness of ts_headline picking up it up when the two words are
that far distant each other, in exchange of possible degradation.

[1] For developers, wparser_def.c:2582
> * We might eventually make max_cover a user-settable parameter, but for
> * now, just compute a reasonable value based on max_words and
> * max_fragments.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Marco Boeringa 2022-07-25 06:04:52 Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Previous Message PG Bug reporting form 2022-07-25 01:04:06 BUG #17558: 15beta2: Endless loop with UNIQUE NULLS NOT DISTINCT and INSERT ... ON CONFLICT