Use of additional index columns in rows filtering

From: Maxim Ivanov <hi(at)yamlcoder(dot)me>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Use of additional index columns in rows filtering
Date: 2023-02-15 08:57:31
Message-ID: N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

I'd like to report what seems to be a missing optimization opportunity or understand why it is not possible to achieve.

TLDR; additional index column B specified in CREATE INDEX .. (A) INCLUDE(B) is not used to filter rows in queries like WHERE B = $1 ORDER BY A during IndexScan. https://dbfiddle.uk/iehtq44L

Take following database:

CREATE TABLE t(
a integer NOT NULL,
b integer NOT NULL,
d integer NOT NULL
);

CREATE UNIQUE INDEX t_a_include_b ON t (a) INCLUDE (b);
-- I'd expect index above to behave as index below for the purpose
-- of this query
-- CREATE UNIQUE INDEX ON t(a,b);

INSERT INTO t(
SELECT random() * 100000000 as a,
random() * 3 as b,
generate_series as d FROM generate_series(1,200000)
) ON CONFLICT DO NOTHING;

If we filter on `a` and `b` columns while scanning index created as `(a) INCLUDE (b)` it seems to be fetching tuples from heap to check for condition `b = 4` despite both columns available in the index:

SELECT * FROM t WHERE a > 1000000 and b = 4 ORDER BY a ASC LIMIT 10;

Here is the plan (notice high "shared hit"):

Limit (cost=0.42..10955.01 rows=1 width=12) (actual time=84.283..84.284 rows=0 loops=1)
   Output: a, b, d
   Buffers: shared hit=198307
   -> Index Scan using t_a_include_b on public.t (cost=0.42..10955.01 rows=1 width=12) (actual time=84.280..84.281 rows=0 loops=1)
         Output: a, b, d
         Index Cond: (t.a > 1000000)
         Filter: (t.b = 4)
         Rows Removed by Filter: 197805
         Buffers: shared hit=198307
Planning:
   Buffers: shared hit=30
Planning Time: 0.201 ms
Execution Time: 84.303 ms

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 loops=1)
   Output: a, b, d
   Buffers: shared hit=613
   -> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 width=12) (actual time=6.880..6.881 rows=0 loops=1)
         Output: a, b, d
         Index Cond: ((t.a > 1000000) AND (t.b = 4))
         Buffers: shared hit=613
Planning:
   Buffers: shared hit=41
Planning Time: 0.314 ms
Execution Time: 6.910 ms

Because query doesn't sort on `b`, only filters on it while sorting on `a`, I'd expect indexes `(a) INCLUDE (b)` and `(a,b)` behave exactly the same with this particular query.

Interestingly, IndexOnlyScan is capable of using additional columns to filter rows without fetching them from the heap, but only for visibible tuples:

VACUUM FREEZE t;
SELECT a,b FROM t WHERE a > 1000000 and b = 4 ORDER BY a ASC LIMIT 10;

Limit (cost=0.42..6619.76 rows=1 width=8) (actual time=18.479..18.480 rows=0 loops=1)
  Output: a, b
   Buffers: shared hit=662
   -> Index Only Scan using t_a_include_b on public.t (cost=0.42..6619.76 rows=1 width=8) (actual time=18.477..18.477 rows=0 loops=1)
         Output: a, b
        Index Cond: (t.a > 1000000)
         Filter: (t.b = 4)
         Rows Removed by Filter: 197771
         Heap Fetches: 0
         Buffers: shared hit=662

Removing VACUUM makes it behave like IndexScan and fetch candidate tuples from heap all while returning zero rows in the result.

To make query plan comparable I had to force index scan on both with:

SET enable_bitmapscan to off;
SET enable_seqscan to off;
SET max_parallel_workers_per_gather = 0;

Self contained fully reproducible example is in https://dbfiddle.uk/iehtq44L

Regards,
Maxim

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-02-15 09:06:49 Re: [PATCH] Add pretty-printed XML output option
Previous Message Alvaro Herrera 2023-02-15 08:32:01 Re: Support logical replication of DDLs