Re: Avoid full GIN index scan when possible

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marc Cousin <cousinmarc(at)gmail(dot)com>
Subject: Re: Avoid full GIN index scan when possible
Date: 2019-06-29 10:25:14
Message-ID: 20190629102514.ucfzglxu7ccjsbjr@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote:
>On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov
><n(dot)gluhov(at)postgrespro(dot)ru> wrote:>
>> On 29.06.2019 1:23, Julien Rouhaud wrote:
>>
>> But that kinda resembles stuff we already have - selectivity/cost. So
>> why shouldn't this be considered as part of costing?
>>
>> Yeah, I'm not entirely convinced that we need anything new here.
>> The cost estimate function can detect such situations, and so can
>> the index AM at scan start --- for example, btree checks for
>> contradictory quals at scan start. There's a certain amount of
>> duplicative effort involved there perhaps, but you also have to
>> keep in mind that we don't know the values of run-time-determined
>> comparison values until scan start. So if you want certainty rather
>> than just a cost estimate, you may have to do these sorts of checks
>> at scan start.
>>
>> Ah, I didn't know about _bt_preprocess_keys(). I'm not familiar with
>> this code, so please bear with me. IIUC the idea would be to add
>> additional logic in gingetbitmap() / ginNewScanKey() to drop some
>> quals at runtime. But that would mean that additional logic would
>> also be required in BitmapHeapScan, or that all the returned bitmap
>> should be artificially marked as lossy to enforce a recheck?
>>
>> We have a similar solution for this problem. The idea is to avoid full index
>> scan inside GIN itself when we have some GIN entries, and forcibly recheck
>> all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no
>> GIN entries.
>
>Thanks for looking at it. That's I think a way better approach.
>
>> The attached patch in its current shape contain at least two ugly places:
>>
>> 1. We still need to initialize empty scan key to call triconsistent(), but
>> then we have to remove it from the list of scan keys. Simple refactoring
>> of ginFillScanKey() can be helpful here.
>>
>> 2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL
>> if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL
>> because we need to skip NULLs in GIN_SEARCH_MODE_ALL. Simplest example here
>> is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not forced,
>> and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked. Maybe
>> it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL.
>
>Also
>
>+ if (searchMode == GIN_SEARCH_MODE_ALL && nQueryValues <= 0)
>+ {
>+ /*
>+ * Don't emit ALL key with no entries, check only whether
>+ * unconditional recheck is needed.
>+ */
>+ GinScanKey key = &so->keys[--so->nkeys];
>+
>+ hasSearchAllMode = true;
>+ so->forcedRecheck = key->triConsistentFn(key) != GIN_TRUE;
>+ }
>
>Shouldn't you make sure that the forcedRecheck flag can't reset?
>
>> -- patched
>> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%';
>> QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------
>> Bitmap Heap Scan on test (cost=20.43..176.79 rows=42 width=6) (actual time=0.287..0.424 rows=300 loops=1)
>> Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
>> Rows Removed by Index Recheck: 2
>> Heap Blocks: exact=114
>> -> Bitmap Index Scan on test_t_idx (cost=0.00..20.42 rows=42 width=0) (actual time=0.271..0.271 rows=302 loops=1)
>> Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
>> Planning Time: 0.080 ms
>> Execution Time: 0.450 ms
>> (8 rows)
>
>One thing that's bothering me is that the explain implies that the
>LIKE '%i% was part of the index scan, while in reality it wasn't. One
>of the reason why I tried to modify the qual while generating the path
>was to have the explain be clearer about what is really done.

Yeah, I think that's a bit annoying - it'd be nice to make it clear
which quals were actually used to scan the index. It some cases it may
not be possible (e.g. in cases when the decision is done at runtime, not
while planning the query), but it'd be nice to show it when possible.

A related issue is that during costing is too late to modify cardinality
estimates, so the 'Bitmap Index Scan' will be expected to return fewer
rows than it actually returns (after ignoring the full-scan quals).
Ignoring redundant quals (the way btree does it at execution) does not
have such consequence, of course.

Which may be an issue, because we essentially want to modify the list of
quals to minimize the cost of

bitmap index scan + recheck during bitmap heap scan

OTOH it's not a huge issue, because it won't affect the rest of the plan
(because that uses the bitmap heap scan estimates, and those are not
affected by this).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-06-29 10:41:21 Re: Multivariate MCV list vs. statistics target
Previous Message Magnus Hagander 2019-06-29 09:55:30 Re: Commitfest 2019-07, the first of five* for PostgreSQL 13