Re: Use of additional index columns in rows filtering

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Maxim Ivanov <hi(at)yamlcoder(dot)me>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
Subject: Re: Use of additional index columns in rows filtering
Date: 2023-07-21 19:17:50
Message-ID: CAH2-WzkaP=gQp9uaY4Xrv74WetTttcU+C-tWbtrQLm+=CaUDdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 21, 2023 at 4:52 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > (Actually, I'm glossing over a lot. The MDAM paper describes
> > techniques that'd make even the really challenging cases safe, through
> > a process of converting quals from conjunctive normal form into
> > disjunctive normal form. This is more or less the form that the state
> > machine implemented by _bt_advance_array_keys() produces already,
> > today. But even with all this practically all of the heavy lifting
> > takes place before the index scan even begins, during preprocessing --
> > so you still require surprisingly few changes to index scans
> > themselves.)
> >
>
> Ah, OK. I was assuming the execution might be more complex.

Sort of. Execution of individual "primitive index scans" effectively
works the same way as it does already -- the preprocessing is required
to generate disjunctive primitive index scans that look like one big
index scan when combined (which is how native execution of SAOPs by
nbtree works today).

The challenge during execution of index scans (execution proper, not
preprocessing) comes from processing a "flattened" DNF representation
of your original quals efficiently. If you have (say) 3 SAOPs, then
the total number of distinct DNF quals is the cartesian product of the
3 arrays -- which is multiplicative. But, you can skip over the
single-value DNF quals quickly when they have no matches. Which isn't
all that hard.

We get some very useful invariants with these DNF quals: you can have
at most one individual DNF qual as a match for any individual index
tuple. Plus the quals are materialized in key space order, which is
ideally suited to processing by an ordered scan. So just as you can
use the array keys to skip over parts of the index when searching for
an index tuple, you can use an index tuple to skip over the arrays
when searching for the next relevant set of array keys. It works both
ways!

> But I was
> thinking more about the costing part - if you convert the clauses in
> some way, does that affect the reliability of estimates somehow?

Obviously, it doesn't affect the selectivity at all. That seems most
important (you kinda said the same thing yourself).

> If the
> conversion from AND to OR makes the list of clauses more complex, that
> might be an issue ...

That's definitely a concern. Even still, the biggest problem by far in
this general area is selectivity estimation. Which, in a way, can be
made a lot easier by this general approach.

Let's say we have the tenk1 table, with the same composite index as in
my example upthread (on "(two,four,twenty)"). Further suppose you have
a very simple query: "select count(*) from tenk1". On master (and with
the patch) that's going to give you an index-only scan on the
composite index (assuming it's the only index), which is quite
efficient despite being a full index scan -- 11 buffer hits.

This much more complicated count(*) query is where it gets interesting:

select
count(*),
two,
four,
twenty
from
tenk1_dyn_saop
where
two in (0, 1)
and four in (1, 2, 3, 4)
and twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
group by
two,
four,
twenty
order by
two,
four,
twenty;

It's inherently very difficult to predict how selective this query
will be using basic statistics. But maybe it doesn't need to matter so
much, so often.

The patch can execute this with an index-only scan + GroupAggregate.
What ends up happening is that the patch gets 9 buffer hits -- so
pretty close to 11. Master can use almost the same query plan (it uses
quals but needs to hashagg+ sort). It ends up getting 245 buffer hits
-- vastly more than what we see for the full index scan case (and
nothing to do with the sort/an issue with a limit). That's nearly as
many hits as you'd get with a sequential scan. (BTW, I don't need to
coax the query planner to get this result on master.)

With the patch you can vary the predicate in whatever way, so that the
selectivity shifts up or down. Occasionally you'll get maybe one extra
buffer access relative to the base full index scan case, but overall
the patch makes the worst case look very much like a full index scan
(plus some relatively tiny CPU overhead). This is just common sense,
in a way; selectivities are always between 0.0 and 1.0. Why shouldn't
we be able to think about it like that?

> I wasn't really thinking about LIMIT, and I don't think it changes the
> overall behavior very much (sure, it's damn difficult to estimate for
> skewed data sets, but meh).
>
> The case I had in mind is something like this:
>
> CREATE TABLE t (a int, b int, c int);
> CREATE INDEX ON t (a);
> INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i);
>
> SELECT * FROM t WHERE bad_qual(a) AND b < 1 AND c < 1 ORDER BY a;
>
> where bad_qual is expensive and matches almost all rows.

You must distinguish between quals that can become required scan keys
(or can terminate the scan), and all other quals. This is really
important for my pending SAOP patch, but I think it might be important
here too. I wonder if the best place to address the possibility of
such a regression is in the index AM itself.

Let's make your example a bit more concrete: let's assume that
bad_qual is a very expensive integer comparison, against a column that
has only one possible value. So now your example becomes:

CREATE TABLE t (a expensive_int, b int, c int);
CREATE INDEX ON t (a);
INSERT INTO t SELECT 42, i, i FROM generate_series(1,1000000) s(i);
SELECT * FROM t a in (7, 42) AND b < 1 AND c < 1 ORDER BY a;

(I'm using a SAOP here because the planner will more or less disregard
the ORDER BY if I make it "= 42" instead. Maybe that makes it
simpler.)

Sure, you're getting a full index scan here, and you get all these
useless comparisons on "a" -- that's a real risk. But AFAICT there is
no real need for it. There is another nbtree patch that might help. A
patch that teaches nbtree's _bt_readpage function to skip useless
comparisons like this:

https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571@garret.ru

In order for this kind of optimization to be possible at all, we must
be able to reason about "a" as a column whose values will always be in
key space order. That is, nbtree must recognize that "a" is the most
significant key column, not (say) a low-order column from a composite
index -- it's a required column in both directions. If _bt_readpage
can determine that the first tuple on a leaf page has the value "42",
and the high key has that same value, then we can skip all of the
comparisons of "a" for that page, right away (in fact we don't require
any comparisons). Now it doesn't matter that they're super expensive
comparisons (or it hardly matters).

It's natural to think of things like this _bt_readpage optimization as
something that makes existing types of plan shapes run faster. But you
can also think of them as things that make new and fundamentally
better plan shapes feasible, by making risky things much less risky.

> FWIW the "ORDER BY" is necessary, because otherwise we may not even
> build the index path (no index keys, no interesting pathkeys). It's just
> an opportunistic optimization - if already doing index scan, try doing
> this too. I wonder if we should relax that ...

I'm kinda doing the same thing with ordering in my own patch. In
general, even if the query really doesn't care about the index order,
there may be a lot of value in making the nbtree code understand that
this is an ordered index scan. That's what enables skipping, in all
its forms (skipping individual comparisons, skipping whole subsections
of the index, etc).

I'm not saying that this is 100% problem free. But it seems like a
promising high level direction.

> In a way, focusing on the worst case does that by assuming the worst
> combination - which is fine, although it may choose the slower (but
> safer) approach in some cases.

I don't think that it has to be slower on average (even by a tiny
bit). It might just end up being slightly faster on average, and way
faster on occasion.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2023-07-21 22:30:06 Re: Show WAL write and fsync stats in pg_stat_io
Previous Message Jeff Davis 2023-07-21 18:30:22 Re: MERGE ... RETURNING