Re: Missing CFI in hlCover()?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Missing CFI in hlCover()?
Date: 2020-07-29 23:46:08
Message-ID: 1343033.1596066368@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> (I wonder if we need to try to make it faster. I'd supposed that the
> loop was cheap enough to be a non-problem, but with large enough
> documents maybe not? It seems like converting to a hash table could
> be worthwhile for a large doc.)

OK, I dug into Stephen's test case off-list. While some CFIs would
be a good idea, that's just band-aid'ing the symptom. The actual
problem is that hlCover() is taking way too much time. The test case
boils down to "common_word & rare_word" matched to a very long document,
wherein the rare_word appears only near the front. Once we're past
that match, hlCover() tries all the remaining matches for common_word,
of which there are plenty ... and for each one, it scans clear to the
end of the document, looking vainly for a substring that will satisfy
the "common_word & rare_word" query. So obviously, this is O(N^2)
in the length of the document :-(.

I have to suppose that I introduced this problem in c9b0c678d, since
AFAIR we weren't getting ts_headline-takes-forever complaints before
that. It's not immediately obvious why the preceding algorithm doesn't
have a similar issue, but then there's not anything at all that was
obvious about the preceding algorithm.

The most obvious way to fix the O(N^2) hazard is to put a limit on the
length of "cover" (matching substring) that we'll consider. For the
mark_hl_words caller, we won't highlight more than max_words words
anyway, so it would be reasonable to bound covers to that length or
some small multiple of it. The other caller mark_hl_fragments is
willing to highlight up to max_fragments of up to max_words each, and
there can be some daylight between the fragments, so it's not quite
clear what the longest reasonable match is. Still, I doubt it's
useful to show a headline consisting of a few words from the start of
the document and a few words from thousands of words later, so a limit
of max_fragments times max_words times something would probably be
reasonable.

We could hard-code a rule like that, or we could introduce a new
explicit parameter for the maximum cover length. The latter would be
more flexible, but we need something back-patchable and I'm concerned
about the compatibility hazards of adding a new parameter in minor
releases. So on the whole I propose hard-wiring a multiplier of,
say, 10 for both these cases.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-07-29 23:51:36 Re: [PATCH] Initial progress reporting for COPY command
Previous Message Fujii Masao 2020-07-29 23:44:26 Re: [PATCH] Tab completion for VACUUM of partitioned tables