Re: Additional improvements to extended statistics

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-03-24 13:20:07
Message-ID: CAEZATCX8u9bZzcWEzqA_t7f_OQHu2oxeTUGnFHNEOXnJo35AQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Hmmm. So let's consider a simple OR clause with two arguments, both
> covered by single statistics object. Something like this:
>
> CREATE TABLE t (a int, b int);
> INSERT INTO t SELECT mod(i, 10), mod(i, 10)
> FROM generate_series(1,100000);
> CREATE STATISTICS s (mcv) ON a,b FROM t;
>
> SELECT * FROM t WHERE a = 0 OR b = 0;
>
> Which is estimated correctly...
>

Hmm, the reason that is estimated correctly is that the MCV values
cover all values in the table, so mcv_totalsel is 1 (or pretty close
to 1), and other_sel is capped to nearly 0, and so the result is
basically just mcv_sel -- i.e., in this case the MCV estimates are
returned more-or-less as-is, rather than being applied as a
correction, and so the result is independent of how many times
extended stats are applied.

The more interesting (and probably more realistic) case is where the
MCV values do not cover the all values in the table, as in the new
mcv_lists_partial examples in the regression tests, for example this
test case, which produces a significant overestimate:

SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0

although actually, I think there's another reason for that (in
addition to the extended stats correction being applied twice) -- the
way the MCV code makes use of base selectivity doesn't seem really
appropriate for OR clauses, because the way base_frequency is computed
is really based on the assumption that every column would be matched.
I'm not sure that there's any easy answer to that though. I feels like
what's needed when computing the match bitmaps in mcv.c, is to produce
a bitmap (would it fit in 1 byte?) per value, to indicate which
columns match, and base_frequency values per column. That would be
significantly more work though, so almost certainly isn't feasible for
PG13.

> IIUC the problem you have in mind is that we end up calling
> statext_mcv_clauselist_selectivity twice, once for the top-level AND
> clause with a single argument, and then recursively for the expanded OR
> clause. Indeed, that seems to be applying the correction twice.
>
>
> >I'm not sure what's the best way to resolve that. Perhaps
> >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
> >should ignore OR clauses from an AND-list, on the basis that they will
> >get processed recursively later. Or perhaps estimatedclauses can
> >somehow be used to prevent this, though I'm not sure exactly how that
> >would work.
>
> I don't know. I feel uneasy about just ignoring some of the clauses,
> because what happens for complex clauses, where the OR is not directly
> in the AND clause, but is negated or something like that?
>
> Isn't it the case that clauselist_selectivity_simple (and the OR
> variant) should ignore extended stats entirely? That is, we'd need to
> add a flag (or _simple variant) to clause_selectivity, so that it calls
> causelist_selectivity_simple_or. So the calls would look like this:
>
> clauselist_selectivity
> statext_clauselist_selectivity
> statext_mcv_clauselist_selectivity
> clauselist_selectivity_simple <--- disable extended stats
> clause_selectivity
> clauselist_selectivity_simple_or
> clause_selectivity
> clause_selectivity
> mcv_clauselist_selectivity
> clauselist_selectivity_simple
> already estimated
>
> I've only quickly hacked clause_selectivity, but it does not seems very
> invasive (of course, it means disruption to clause_selectivity callers,
> but I suppose most call clauselist_selectivity).
>

Sounds like a reasonable approach, but I think it would be better to
preserve the current public API by having clauselist_selectivity()
become a thin wrapper around a new function that optionally applies
extended stats.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2020-03-24 13:26:43 Re: Built-in connection pooler
Previous Message Dilip Kumar 2020-03-24 13:06:03 Re: Fastpath while arranging the changes in LSN order in logical decoding