Re: strange plan with bitmap heap scan and multiple partial indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange plan with bitmap heap scan and multiple partial indexes
Date: 2015-07-11 21:20:23
Message-ID: 55A18897.5090008@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/11/2015 06:32 PM, Tom Lane wrote:
...
>
> Presumably, this is happening because the numbers of rows actually
> satisfying the index predicates are so small that it's a matter of
> luck whether any of them are included in ANALYZE's sample.
>
> Given this bad data for the index sizes, it's not totally surprising
> that choose_bitmap_and() does something wacko. I'm not sure whether
> we should try to make it smarter, or write this off as "garbage in,
> garbage out".

I think we should make it smarter, if possible - while this example is
somewhat artificial, partial indexes are often used exactly like this,
i.e. to index only very small subset of data. A good example may be an
index on "active invoices", i.e. invoices that were yet sorted out.
There may be a lot of invoices in the table, but only very small
fraction of them will be active (and thus in the index).

So I don't think is an artificial problem, and we should not write it
off as "garbage in".

> Another idea is to not trust any individual ANALYZE's estimate of
> the index rowcount so completely. (I'd thought that the
> moving-average logic would get applied to that, but it doesn't seem
> to be kicking in for some reason.)
>
> We could probably make this smarter if we were willing to apply the
> predicate-proof machinery in more situations; in this example, once
> we know that idx001 is applicable, we really should disregard idx002
> and idx003 altogether because their predicates are implied by
> idx001's. I've always been hesitant to do that because the cost of
> checking seemed likely to greatly outweigh the benefits. But since
> Tomas is nosing around in this territory already, maybe he'd like to
> investigate that further.

I think there are two possible approaches in general - we may improve
the statistics somehow, or we may start doing the predicate proofing.

I doubt approaching this at the statistics level alone is sufficient,
because even with statistics target 10k (i.e. the most detailed one),
the sample is still fixed-size. So there will always exist a combination
of a sufficiently large data set and selective partial index, causing
trouble with the sampling.

Moreover, I can't really think of a way to fix this at the statistics
level. Maybe there's a clever trick guarding against this particular
issue, but my personal experience is that whenever I used such a smart
hack, it eventually caused strange issues elsewhere.

So I think the predicate proofing is a better approach, but of course
the planning cost may be an issue. But maybe we can make this cheaper by
some clever tricks? For example, given two predicates A and B, it seems
that if A => B, then selectivity(A) <= selectivity(B). Could we use this
to skip some of the expensive stuff? We should have the selectivities
anyway, no?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2015-07-11 21:21:58 Re: more RLS oversights
Previous Message Tom Lane 2015-07-11 20:28:40 TABLESAMPLE patch is really in pretty sad shape