Re: Additional improvements to extended statistics

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-03-24 14:33:06
Message-ID: 20200324143306.rz5siohm4jsoveqm@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:
>On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Hmmm. So let's consider a simple OR clause with two arguments, both
>> covered by single statistics object. Something like this:
>>
>> CREATE TABLE t (a int, b int);
>> INSERT INTO t SELECT mod(i, 10), mod(i, 10)
>> FROM generate_series(1,100000);
>> CREATE STATISTICS s (mcv) ON a,b FROM t;
>>
>> SELECT * FROM t WHERE a = 0 OR b = 0;
>>
>> Which is estimated correctly...
>>
>
>Hmm, the reason that is estimated correctly is that the MCV values
>cover all values in the table, so mcv_totalsel is 1 (or pretty close
>to 1), and other_sel is capped to nearly 0, and so the result is
>basically just mcv_sel -- i.e., in this case the MCV estimates are
>returned more-or-less as-is, rather than being applied as a
>correction, and so the result is independent of how many times
>extended stats are applied.
>
>The more interesting (and probably more realistic) case is where the
>MCV values do not cover the all values in the table, as in the new
>mcv_lists_partial examples in the regression tests, for example this
>test case, which produces a significant overestimate:
>
> SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0
>
>although actually, I think there's another reason for that (in
>addition to the extended stats correction being applied twice) -- the
>way the MCV code makes use of base selectivity doesn't seem really
>appropriate for OR clauses, because the way base_frequency is computed
>is really based on the assumption that every column would be matched.
>I'm not sure that there's any easy answer to that though. I feels like
>what's needed when computing the match bitmaps in mcv.c, is to produce
>a bitmap (would it fit in 1 byte?) per value, to indicate which
>columns match, and base_frequency values per column. That would be
>significantly more work though, so almost certainly isn't feasible for
>PG13.
>

Good point. I haven't thought about the base frequencies. I think 1 byte
should be enough, as we limit the number of columns to 8.

>> IIUC the problem you have in mind is that we end up calling
>> statext_mcv_clauselist_selectivity twice, once for the top-level AND
>> clause with a single argument, and then recursively for the expanded OR
>> clause. Indeed, that seems to be applying the correction twice.
>>
>>
>> >I'm not sure what's the best way to resolve that. Perhaps
>> >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
>> >should ignore OR clauses from an AND-list, on the basis that they will
>> >get processed recursively later. Or perhaps estimatedclauses can
>> >somehow be used to prevent this, though I'm not sure exactly how that
>> >would work.
>>
>> I don't know. I feel uneasy about just ignoring some of the clauses,
>> because what happens for complex clauses, where the OR is not directly
>> in the AND clause, but is negated or something like that?
>>
>> Isn't it the case that clauselist_selectivity_simple (and the OR
>> variant) should ignore extended stats entirely? That is, we'd need to
>> add a flag (or _simple variant) to clause_selectivity, so that it calls
>> causelist_selectivity_simple_or. So the calls would look like this:
>>
>> clauselist_selectivity
>> statext_clauselist_selectivity
>> statext_mcv_clauselist_selectivity
>> clauselist_selectivity_simple <--- disable extended stats
>> clause_selectivity
>> clauselist_selectivity_simple_or
>> clause_selectivity
>> clause_selectivity
>> mcv_clauselist_selectivity
>> clauselist_selectivity_simple
>> already estimated
>>
>> I've only quickly hacked clause_selectivity, but it does not seems very
>> invasive (of course, it means disruption to clause_selectivity callers,
>> but I suppose most call clauselist_selectivity).
>>
>
>Sounds like a reasonable approach, but I think it would be better to
>preserve the current public API by having clauselist_selectivity()
>become a thin wrapper around a new function that optionally applies
>extended stats.
>

OK, makes sense. I'll take a stab at it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2020-03-24 14:34:34 Re: color by default
Previous Message Tom Lane 2020-03-24 14:31:40 Re: documentation pdf build fail (HEAD)