Re: Avoid full GIN index scan when possible

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Marc Cousin <cousinmarc(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Avoid full GIN index scan when possible
Date: 2019-08-07 20:32:20
Message-ID: 19438.1565209940@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru> writes:
> Attached 6th version of the patches.

I spent a bit of time looking at these. Attached is a proposed revision
of the 0001 patch, with some minor changes:

* I didn't adopt your move of the "Non-default modes require the index
to have placeholders" test to after the stanza handling zero-key cases.
I think that move would be correct for 0001 as it stands, but it's far
less clear that it's still safe after 0002/0003 or whatever variant of
those we end up with. We should leave that code where it is for now,
enforcing the v1-index requirement for all non-default search modes, and
reconsider after the dust settles. (Or if we never do reconsider, it
won't be a big deal --- I doubt many v0 indexes are still out there.)

* Rearranged the selfuncs.c logic to match ginNewScanKey better.

* Cleaned up my own sloppiness in the new gin.sql test cases.

I think this would be committable as it stands, except that replacing
an ALL scan with an EVERYTHING scan could be a performance regression
if the index contains many null items. We need to do something about
that before committing.

Unfortunately I'm not sold on either 0002 or 0003 as they stand;
they seem overly complicated, I'm not convinced they're correct,
and you haven't really provided examples showing that all this
extra complexity is worthwhile.

In particular:

* I don't really like the whole business about detecting a constant-true
ALL condition by applying the consistentFn at this stage. That just
feels wrong to me: the consistentFn should be expecting some data about
the index contents and we don't have any to give. If it works, it's
accidental, and probably it's fragile. Moreover, the only gain we'd get
from it is maybe not having to set forcedRecheck, and that doesn't look
to me like it would make all that much difference.

* The code seems to be assuming that a zero-key ALL query is necessarily
precisely equivalent to a NOT NULL condition. This seems flat out wrong.
At the very least it's possible for such a query to be constant-false,
rather than constant-true-for-non-null-items. Admittedly, that would
suggest rather stupid coding of the opclass query-extract function, as
it could have reported a constant-false condition in an optimizable way
rather than an unoptimizable one. But we aren't entitled to assume that
the opclass isn't being dumb; the API says it can do this, so it can.
I think we have to check the scankey regardless, either in the index or
via forcedRecheck.

* I really dislike the refcount business in 0003. It's not clear why we
need that or whether it's correct, and I think it would be unmaintainable
even if it were documented (which it isn't).

ISTM we could get where we need to go in a much simpler way. A couple
of alternative ideas:

* During ginNewScanKey, separate out ALL-mode queries and don't add them
to the scankey list immediately. After examining all the keys, if we
found any normal (DEFAULT or INCLUDE_EMPTY) queries, then go ahead and
add in the ALL-mode queries so that we can check them in the index, but
don't cause a full scan. Otherwise, discard all the ALL-mode queries
and emit a NOT_NULL scan key, setting forcedRecheck so that we apply the
filtering that way.

* Or we could just discard ALL-mode queries on sight, setting
forcedRecheck always, and then emit NOT_NULL if we had any
of those and no normal queries.

The thing that seems hard to predict here is whether it is worth tracking
the presence of additional keys in the index to avoid a recheck in the
heap. It's fairly easy to imagine that for common keys, avoiding the
recheck would be a net loss because it requires more work in the index.
If we had statistics about key counts, which of course we don't, you
could imagine making this decision dynamically depending on whether an
ALL query is asking about common or uncommon keys.

BTW --- any way you slice this, it seems like we'd end up with a situation
where we never execute an ALL query against the index in the way we do
now, meaning that the relevant code in the scanning logic is dead and
could be removed. Given that, we don't really need a new NOT_NULL search
mode; we could just redefine what ALL mode actually does at execution.
This would be good IMO, because it's not obvious what the difference
between ALL and NOT_NULL modes is anyway.

regards, tom lane

Attachment Content-Type Size
Avoid-GIN-full-scan-for-empty-ALL-keys-v07.patch text/x-diff 11.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2019-08-07 20:47:17 Re: Adding a test for speculative insert abort case
Previous Message Alvaro Herrera 2019-08-07 20:27:06 Re: Problem with default partition pruning