Re: Avoid full GIN index scan when possible

From: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Marc Cousin <cousinmarc(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Avoid full GIN index scan when possible
Date: 2019-11-22 23:35:50
Message-ID: f2889144-db1d-e3b2-db97-cfc8794cda43@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached 8th version of the patches.

A brief description of the patches and their improvements/overheads:

1. Avoid full scan in "empty-ALL AND regular" case.
One EVERYTHING entry with unconditional recheck is used instead of multiple
ALL entries.
Overhead for rechecking NULLs and "empty-ALL" keys is introduced.
Overhead of merging ALL-lists for multicolumn indexes is eliminated.

2. Fix overhead of rechecking NULLs.
Returned back overhead of merging NOT_NULL-lists for multicolumn indexes.

3. Fix overhead of unnecessary recheck of "empty-ALL" keys using consistentFn().
Performance of "empty-ALL [AND empty-ALL ...]" case now is the same as on
master.

4. Avoid full scan in "non-empty-ALL AND regular" case.
New variant of list-merging logic added to scanGetItem().

On 07.08.2019 23:32, Tom Lane wrote:
> Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru> writes:
>> Attached 6th version of the patches.
> I spent a bit of time looking at these. Attached is a proposed revision
> of the 0001 patch, with some minor changes:
>
> * I didn't adopt your move of the "Non-default modes require the index
> to have placeholders" test to after the stanza handling zero-key cases.
> I think that move would be correct for 0001 as it stands, but it's far
> less clear that it's still safe after 0002/0003 or whatever variant of
> those we end up with. We should leave that code where it is for now,
> enforcing the v1-index requirement for all non-default search modes, and
> reconsider after the dust settles. (Or if we never do reconsider, it
> won't be a big deal --- I doubt many v0 indexes are still out there.)

Ok.

> * Rearranged the selfuncs.c logic to match ginNewScanKey better.
>
> * Cleaned up my own sloppiness in the new gin.sql test cases.
>
> I think this would be committable as it stands, except that replacing
> an ALL scan with an EVERYTHING scan could be a performance regression
> if the index contains many null items. We need to do something about
> that before committing.

Yes, such performance regression does exist (see test results at the end).
And it exists not only if there are many NULLs -- recheck overhead is
significant even in the simple cases like "array @> '{}'". This really
makes patch 0001 non-committable.

And the only thing I can here offer to fix this is the patches 0002 and 0003.

> Unfortunately I'm not sold on either 0002 or 0003 as they stand;
> they seem overly complicated, I'm not convinced they're correct,
> and you haven't really provided examples showing that all this
> extra complexity is worthwhile.

Yes, they are complicated, but I tried to simplify 0002 a bit, and even
divided it into two separate patches 0002 and 0003. For the performance
improvements, see the test results at the end.

> In particular:
>
> * I don't really like the whole business about detecting a constant-true
> ALL condition by applying the consistentFn at this stage. That just
> feels wrong to me: the consistentFn should be expecting some data about
> the index contents and we don't have any to give. If it works, it's
> accidental, and probably it's fragile.

If we have no entries, then there is nothing to pass to consistentFn() and it
should always return the same value for a given scankey. The similar
technique is used in startScanKey() where the fake entryRes[] is passed to it.

> Moreover, the only gain we'd get from it is maybe not having to set
> forcedRecheck, and that doesn't look to me like it would make all that
> much difference.

The forced recheck has a non-zero cost, so this makes real sense.

> * The code seems to be assuming that a zero-key ALL query is necessarily
> precisely equivalent to a NOT NULL condition. This seems flat out wrong.
> At the very least it's possible for such a query to be constant-false,
> rather than constant-true-for-non-null-items. Admittedly, that would
> suggest rather stupid coding of the opclass query-extract function, as
> it could have reported a constant-false condition in an optimizable way
> rather than an unoptimizable one. But we aren't entitled to assume that
> the opclass isn't being dumb; the API says it can do this, so it can.
> I think we have to check the scankey regardless, either in the index or
> via forcedRecheck.

Yes, empty ALL queries are equivalent to NOT NULL with or without recheck.
Patches 0001 and 0002 use unconditional forcedRecheck. Patch 0003 uses
conditional recheck depending on the result of triConsistentFn() invocation.
I added missing check for GIN_FALSE to eliminate constant-false empty
ALL queries. So, the empty ALL scankey is always checked in the index,
but only once at the initialization stage.

> * I really dislike the refcount business in 0003. It's not clear why we
> need that or whether it's correct, and I think it would be unmaintainable
> even if it were documented (which it isn't).
>
>
> ISTM we could get where we need to go in a much simpler way. A couple
> of alternative ideas:
>
> * During ginNewScanKey, separate out ALL-mode queries and don't add them
> to the scankey list immediately. After examining all the keys, if we
> found any normal (DEFAULT or INCLUDE_EMPTY) queries, then go ahead and
> add in the ALL-mode queries so that we can check them in the index, but
> don't cause a full scan. Otherwise, discard all the ALL-mode queries
> and emit a NOT_NULL scan key, setting forcedRecheck so that we apply the
> filtering that way.
>
> * Or we could just discard ALL-mode queries on sight, setting
> forcedRecheck always, and then emit NOT_NULL if we had any
> of those and no normal queries.

I completely rewrote this logic in patch 0004, the reference counting is no
longer needed.

Non-empty ALL keys are immediately added to the list, but the initialization
of hidden ALL entries in them is postponed, and these full scan entries added
only if there are no normal keys. But if we have normal keys, then for each
ALL key enabled special list-merging logic in scanGetItem(), so the items
matching normal keys are emitted to the result even if they have no entries
required for ALL scankeys.

For example, the following intarray query

arr @@ '1' AND arr @@ '!2'

produces two 1-entry scankeys:

DEFAULT('1')
ALL('2') (previously there were 2 entries: '2' and ALL)

When the item lists for the entries '1' and '2' are merged, emitted all items
- having '1' and not having '2', without forced recheck (new logic)
- having both '1' and '2', if triConsistentFn(ALL('2')) returns not FALSE
(ordinary logic, each item must have at least one entry of each scankey)

This helps to do as much work as possible in the index, and to avoid a
unnecessary recheck.

I'm not sure that code changes in scanGetItem() are correct (especially due to
the presence of lossy items), and the whole patch 0004 was not carefully tested,
but if the performance results are interesting, I could work further on this
optimization.

> The thing that seems hard to predict here is whether it is worth tracking
> the presence of additional keys in the index to avoid a recheck in the
> heap. It's fairly easy to imagine that for common keys, avoiding the
> recheck would be a net loss because it requires more work in the index.
> If we had statistics about key counts, which of course we don't, you
> could imagine making this decision dynamically depending on whether an
> ALL query is asking about common or uncommon keys.
>
> BTW --- any way you slice this, it seems like we'd end up with a situation
> where we never execute an ALL query against the index in the way we do
> now, meaning that the relevant code in the scanning logic is dead and
> could be removed. Given that, we don't really need a new NOT_NULL search
> mode; we could just redefine what ALL mode actually does at execution.
> This would be good IMO, because it's not obvious what the difference
> between ALL and NOT_NULL modes is anyway.

The ALL mode is still used now for non-empty ALL queries without normal queries.

Simple performance test:

create table t (a int[], b int[], c int[]);

-- 1M NULLs
insert into t select NULL, NULL, NULL
from generate_series(0, 999999) i;

-- 1M 1-element arrays
insert into t select array[i], array[i], array[i]
from generate_series(0, 999999) i;

-- 10k 2-element arrays with common element
insert into t select array[-1,i], array[-1,i], array[-1,i]
from generate_series(0, 9999) i;

create extension intarray;
create index on t using gin (a gin__int_ops, b gin__int_ops, c gin__int_ops);

| Query time, ms
WHERE condition | master | patches
| | #1 | #2 | #3 | #4
---------------------------------------+--------+------+------+------+------
a @> '{}' | 272 | 473 | 369 | 271 | 261
a @> '{}' and b @> '{}' | 374 | 548 | 523 | 368 | 353
a @> '{}' and b @> '{}' and c @> '{}' | 479 | 602 | 665 | 461 | 446

a @> '{}' and a @@ '1' | 52.2 | 0.4 | 0.4 | 0.4 | 0.4
a @> '{}' and a @@ '-1' | 56.2 | 4.0 | 4.0 | 2.3 | 2.3

a @@ '!-1' and a @@ '1' | 52.8 | 53.0 | 52.7 | 52.9 | 0.3
a @@ '!1' and a @@ '-1' | 54.9 | 55.2 | 55.1 | 55.3 | 2.4

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-Avoid-GIN-full-scan-for-empty-ALL-keys-v08.patch text/x-patch 11.9 KB
0002-Avoid-GIN-recheck-for-NULLs-using-new-search-mode-v08.patch text/x-patch 13.2 KB
0003-Force-GIN-recheck-for-empty-ALL-keys-more-accurately-v08.patch text/x-patch 5.2 KB
0004-Avoid-GIN-full-scan-for-non-empty-ALL-keys-v08.patch text/x-patch 17.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2019-11-23 00:01:15 Re: Assertion failing in master, predicate.c
Previous Message Ranier Vilela 2019-11-22 23:34:45 RE: [PATCH] Tiny optmization or a bug?