Re: [PATCH] fix GIN index search sometimes losing results

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] fix GIN index search sometimes losing results
Date: 2020-05-17 19:53:21
Message-ID: CALT9ZEEW37sR0rP8K7r1Zjxc2e60xXSv7F8kgsJu5tC-AXyK0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, all!
Below is my variant how to patch Gin-Gist weights issue:
1. First of all I propose to shift from previously Gin's own TS_execute
variant and leave only two: TS_execute with bool result and bool type
callback and ternary TS_execute_recurse with ternary callback. I suppose
all legacy consistent callers can still use bool via provided wrapper.
2. I integrated logic for indexes which do not support weights and
positions inside (which gives MAYBE in certain cases on negation) inside
previous TS_execute_recurse function called with additional flag for this
class of indexes.
3. Check function for GIST and GIN now gives ternary result and is called
with ternary type callback. I think in future nothing prevents smoothly
shifting callback functions, check functions and even TS_execute result to
ternary.

So I also send my variant patch for review and discussion.

Regards,
Pavel Borisov

вс, 17 мая 2020 г. в 03:14, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> I wrote:
> > I think the root of the problem is that if we have a query using
> > weights, and we are testing tsvector data that lacks positions/weights,
> > we can never say there's definitely a match. I don't see any decently
> > clean way to fix this without redefining the TSExecuteCallback API
> > to return a tri-state YES/NO/MAYBE result, because really we need to
> > decide that it's MAYBE at the level of processing the QI_VAL node,
> > not later on. I'd tried to avoid that in e81e5741a, but maybe we
> > should just bite that bullet, and not worry about whether there's
> > any third-party code providing its own TSExecuteCallback routine.
> > codesearch.debian.net suggests that there are no external callers
> > of TS_execute, so maybe we can get away with that.
>
> 0001 attached is a proposed patch that does it that way. Given the
> API break involved, it's not quite clear what to do with this.
> ISTM we have three options:
>
> 1. Ignore the API issue and back-patch. Given the apparent lack of
> external callers of TS_execute, maybe we can get away with that;
> but I wonder if we'd get pushback from distros that have automatic
> ABI-break detectors in place.
>
> 2. Assume we can't backpatch, but it's still OK to slip this into
> v13. (This option clearly has a limited shelf life, but I think
> we could get away with it until late beta.)
>
> 3. Assume we'd better hold this till v14.
>
> I find #3 unduly conservative, seeing that this is clearly a bug
> fix, but on the other hand #1 is a bit scary. Aside from the API
> issue, it's not impossible that this has introduced some corner
> case behavioral changes that we'd consider to be new bugs rather
> than bug fixes.
>
> Anyway, some notes for reviewers:
>
> * The core idea of the patch is to make the TS_execute callbacks
> have ternary results and to insist they return TS_MAYBE in any
> case where the correct result is uncertain.
>
> * That fixes the bug at hand, and it also allows getting rid of
> some kluges at higher levels. The GIN code no longer needs its
> own TS_execute_ternary implementation, and the GIST code no longer
> needs to suppose that it can't trust NOT results.
>
> * I put some effort into not leaking memory within tsvector_op.c's
> checkclass_str and checkcondition_str. (The final output array
> can still get leaked, I believe. Fixing that seems like material
> for a different patch, and it might not be worth any trouble.)
>
> * The new test cases in tstypes.sql are to verify that we didn't
> change behavior of the basic tsvector @@ tsquery code. There wasn't
> any coverage of these cases before, and the logic for checkclass_str
> without position info had to be tweaked to preserve this behavior.
>
> * The new cases in tsearch verify that the GIN and GIST code gives
> the same results as the basic operator.
>
> Now, as for the 0002 patch attached: after 0001, the only TS_execute()
> callers that are not specifying TS_EXEC_CALC_NOT are hlCover(),
> which I'd already complained is probably a bug, and the first of
> the two calls in tsrank.c's Cover(). It seems difficult to me to
> argue that it's not a bug for Cover() to process NOT in one call
> but not the other --- moreover, if there was any argument for that
> once upon a time, it probably falls to the ground now that (a) we
> have a less buggy implementation of NOT and (b) the presence of
> phrase queries significantly raises the importance of not taking
> short-cuts. Therefore, 0002 attached rips out the TS_EXEC_CALC_NOT
> flag and has TS_execute compute NOT expressions accurately all the
> time.
>
> As it stands, 0002 changes no regression test results, which I'm
> afraid speaks more to our crummy test coverage than anything else;
> tests that exercise those two functions with NOT-using queries
> would easily show that there is a difference.
>
> Even if we decide to back-patch 0001, I would not suggest
> back-patching 0002, as it's more nearly a definitional change
> than a bug fix. But I think it's a good idea anyway.
>
> I'll stick this in the queue for the July commitfest, in case
> anybody wants to review it.
>
> regards, tom lane
>
>

Attachment Content-Type Size
gin-gist-weight-patch-v3.diff application/octet-stream 21.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Wildish 2020-05-17 23:29:56 [PATCH] Add support to psql for edit-and-execute-command
Previous Message Alvaro Herrera 2020-05-17 15:28:51 Re: Add A Glossary