Re: Avoid full GIN index scan when possible

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Marc Cousin <cousinmarc(at)gmail(dot)com>
Subject: Re: Avoid full GIN index scan when possible
Date: 2020-01-06 15:21:55
Message-ID: 20200106152155.uuungi2p776p5lbk@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 27, 2019 at 04:36:14AM +0300, Nikita Glukhov wrote:
>On 26.12.2019 4:59, Alexander Korotkov wrote:
>>
>> I've tried to add patch #4 to comparison, but I've catch assertion
>>failure.
>>
>>TRAP: FailedAssertion("key->includeNonMatching", File: "ginget.c",
>>Line: 1340)
>There simply should be inverted condition in the assertion:
>Assert(!key->includeNonMatching);
>
>I have looked at v9 patch, and here is my review:
>
>1. I agree with NULL-flag handling simplifications in ginNewScanKey(),
>ginScanKeyAddHiddenEntry() extraction.
>
>2. I also agree that usage of nrequired/nadditional in keyGetItem() is a more
>natural solution to implement exclusion keys than my previous attempt of doing
>that in scanGetKey().
>
>But there are some questions:
>
>Can we avoid referencing excludeOnly flag keyGetItem() by replacing these
>references with !nrequired?
>
>Maybe it would be better to move the whole block of keyGetItem() code
>starting from the first loop over required keys and ending before the loop over
>additional keys inside 'if (key->nrequired) { ... }'?
>
>Can we avoid introducing excludeOnly flag by reusing searchMode and/or by
>moving the initialization of nrequired/nadditional into ginNewScanKey()?
>
>
>3. The following two times repeated NULL-filtering check looks too complicated
>and needs to be refactored somehow:
>
>- res = key->triConsistentFn(key);
>+ if (key->excludeOnly &&
>+ key->nuserentries < key->nentries &&
>+ key->scanEntry[key->nuserentries]->queryCategory == GIN_CAT_NULL_KEY &&
>+ key->entryRes[key->nuserentries] == GIN_TRUE)
>+ res = GIN_FALSE;
>+ else
>+ res = key->triConsistentFn(key);
>
>For example, a special consistentFn() can be introduced for such NOT_NULL
>scankeys. Or even a hidden separate one-entry scankey with a trivial
>consistentFn() can be added instead of adding hidden entry.
>
>
>4. forcedRecheck flag that was previously used for discarded empty ALL scankeys
>is removed now. 0-entry exclusion keys can appear instead, and their
>consistentFn() simply returns constant value. Could this lead to tangible
>overhead in some cases (in comparison to forcedRecheck flag)?
>
>
>5. A hidden GIN_CAT_EMPTY_QUERY is added only for the first empty ALL-scankey,
>NULLs in other columns are filtered out with GIN_CAT_NULL_KEY. This looks like
>asymmetric, and it leads to accelerations is some cases and slowdowns in others
>(depending on NULL fractions and their correlations in columns).
>
>The following test shows a significant performance regression of v9:
>
>insert into t select array[i], NULL, NULL from generate_series(1, 1000000) i;
>
> | Query time, ms
> WHERE condition | master | v8 | v9
>---------------------------------------+--------+--------+---------
> a @> '{}' | 224 | 213 | 212
> a @> '{}' and b @> '{}' | 52 | 57 | 255
> a @> '{}' and b @> '{}' and c @> '{}' | 51 | 58 | 290
>
>
>In the older version of the patch I tried to do the similar things (initialize
>only one NOT_NULL entry for the first column), but refused to do this in v8.
>
>So, to avoid slowdowns relative to master, I can offer simply to add
>GIN_CAT_EMPTY_QUERY entry for each column with empty ALL-keys if there are
>no normal keys.
>

Yeah, I can confirm those results, although on my system the timings are
a bit different (I haven't tested v8):

| Query time, ms
WHERE condition | master | v9
---------------------------------------+--------+---------
a @> '{}' | 610 | 589
a @> '{}' and b @> '{}' | 185 | 665
a @> '{}' and b @> '{}' and c @> '{}' | 185 | 741

So that's something we probably need to address, perhaps by using the
GIN_CAT_EMPTY_QUERY entries as proposed.

I've also tested this on a database storing mailing lists archives with
a trigram index, and in that case the performance with short values gets
much better. The "messages" table has two text fields with a GIN trigram
index - subject and body, and querying them with short/long values works
like this:

WHERE | master | v9
--------------------------------------------------------------
subject LIKE '%aa%' AND body LIKE '%xx%' | 4943 | 4052
subject LIKE '%aaa%' AND body LIKE '%xx%' | 10 | 10
subject LIKE '%aa%' AND body LIKE '%xxx%' | 380 | 13
subject LIKE '%aaa%' AND BODY LIKE '%xxx%' | 2 | 2

which seems fairly nice. I've done tests with individual columns, and
that seems to be working fine too.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2020-01-06 16:00:11 Re: Unicode normalization SQL functions
Previous Message Tom Lane 2020-01-06 15:15:19 Re: pgsql: Add basic TAP tests for psql's tab-completion logic.