Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-09-09 21:49:59
Message-ID: d17f081d-5cb8-3587-8fdc-30d216f7808e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 09/04/2018 04:16 PM, Dean Rasheed wrote:
> On 3 September 2018 at 00:17, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> Attached is an updated version of the patch series, adopting a couple of
>> improvements - both for MCV lists and histograms.
>>
>>
>> MCV
>> ---
>>
>> For the MCV list part, I've adopted the approach proposed by Dean, using
>> base selectivity and using it to correct the non-MCV part. I agree the
>> simplicity of the approach is a nice feature, and it seems to produce
>> better estimates. I'm not sure I understand the approach perfectly, but
>> I've tried to add comments explaining how it works etc.
>>
>
> Cool. Looking at this afresh after some time away, it still looks like
> a reasonable approach, and the test results are encouraging.
>
> In mcv_clauselist_selectivity(), you raise the following question:
>
> if (matches[i] != STATS_MATCH_NONE)
> {
> /* XXX Shouldn't the basesel be outside the if condition? */
> *basesel += mcv->items[i]->base_frequency;
> s += mcv->items[i]->frequency;
> }
>
> The reason it needs to be inside the "if" block is that what it's
> computing is the base (independent) selectivity of the clauses found
> to match the MCV stats, so that then in
> statext_clauselist_selectivity() it can be used in the following
> computation:
>
> /* Estimated selectivity of values not covered by MCV matches */
> other_sel = simple_sel - mcv_basesel;
>
> to give an estimate for the other clauses that aren't covered by the
> MCV stats. So I think the code is correct as it stands, but if you
> think it isn't clear enough, maybe a comment update is in order.
>
> The assumption being made is that mcv_basesel will cancel out the part
> of simple_sel that is due to clauses matching the MCV stats, so that
> we can then just add on the MCV selectivity. Of course that's only an
> approximation, and it won't be exact -- partly due to the fact that
> simple_sel and mcv_basesel are potentially computed using different
> sample rows, and so will differ in the MCV region, and partly because
> of second-order effects arising from the way that selectivities are
> combined in clauselist_selectivity_simple(). Maybe that's something
> that can be improved upon, but it doesn't seem like a bad initial
> approximation.
>

Thanks for the clarification. It's one of the comments I added while
reworking the patch, with still a very limited understanding of the
approach at that point in time. I'll replace it with a comment
explaining the reasoning in the next version.

>
>> I've also changed how we build the MCV lists, particularly how we decide
>> how many / which items store in the MCV list. In the previous version
>> I've adopted the same algorithm we use for per-column MCV lists, but in
>> certain cases that turned out to be too restrictive.
>>
>> Consider for example a table with multiple perfectly correlated columns,
>> with very few combinations. That is, something like this:
>>
>> CREATE TABLE t (a int, b int);
>>
>> INSERT INTO t SELECT mod(i,50), mod(i,50)
>> FROM generate_series(1,1e6) s(i);
>>
>> CREATE STATISTICS s (mcv) ON a,b FROM t;
>>
>> Now, the data distribution is very simple - uniform, with 50 distinct
>> combinations, each representing 2% of data (and the random sample should
>> be pretty close to that).
>>
>> In these cases, analyze_mcv_list decides it does not need any MCV list,
>> because the frequency for each value is pretty much 1/ndistinct. For
>> single column that's reasonable, but for multiple correlated columns
>> it's rather problematic. We might use the same ndistinct approach
>> (assuming we have the ndistinct coefficients), but that still does not
>> allow us to decide which combinations are "valid" with respect to the
>> data. For example we can't decide (1,10) does not appear in the data.
>>
>> So I'm not entirely sure adopting the same algorithm analyze_mcv_list
>> algorithm both for single-column and multi-column stats. It may make
>> sense to keep more items in the multi-column case for reasons that are
>> not really valid for a single single-column.
>>
>> For now I've added a trivial condition to simply keep all the groups
>> when possible. This probably needs more thought.
>>
>
> Ah, this is a good point. I think I see the problem here.
>
> analyze_mcv_list() works by keeping those MCV entries that are
> statistically significantly more frequent than the selectivity that
> would have otherwise been assigned to the values, which is based on
> ndistinct and nullfrac. That's not really right for multivariate stats
> though, because the selectivity that would be assigned to a
> multi-column value if it weren't in the multivariate MCV list is
> actually calculated using the product of individual column
> selectivities. Fortunately we now calculate this (base_frequency), so
> actually I think what's needed is a custom version of
> analyze_mcv_list() that keeps MCV entries if the observed frequency is
> statistically significantly larger than the base frequency, not the
> ndistinct-based frequency.
>

That's probably a good idea and should be an improvement in some cases.
But I think it kinda misses the second part, when a combination is much
less common than the simple product of selectivities (i.e. values that
are somehow "incompatible").

In the example above, there are only (a,b) combinations where (a == b),
so there's nothing like that. But in practice there will be some sort of
noise. So you may observe combinations where the actual frequency is
much lower than the base frequency.

I suppose "significantly larger than the base frequency" would not catch
that, so we need to also consider cases where it's significantly lower.

I also wonder if we could/should consider the multi-variate ndistinct
estimate, somehow. For the univariate case wen use the ndistinct to
estimate the average selectivity for non-MCV items. I think it'd be a
mistake not to leverage this knowledge (when ndistinct coefficients are
available), even if it only helps with simple equality clauses.

> It might also be worthwhile doing a little more work to make the
> base_frequency values more consistent with the way individual column
> selectivities are actually calculated -- currently the patch always
> uses the observed single-column frequencies to calculate the base
> frequencies, but actually the univariate stats would only do that for
> a subset of the single-column values, and the rest would get assigned
> a fixed share of the remaining selectivity-space. Factoring that into
> the base frequency calculation ought to give a better base frequency
> estimate (for use in mcv_clauselist_selectivity() and
> statext_clauselist_selectivity()), as well as give a more principled
> cutoff threshold for deciding which multivariate MCV values to keep.
> It may be possible to reuse some of the existing code for that.
>

I agree, but I think we can leave this for later.

> The initial goal of the base frequency calculation was to replicate
> the univariate stats computations, so that it can be used to give the
> right correction to be applied to the simple_sel value. If it can also
> be used to determine how many MCV entries to keep, that's an added
> bonus.
>

Yep.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hironobu SUZUKI 2018-09-09 22:35:30 Re: pgbench - add pseudo-random permutation function
Previous Message Alexander Korotkov 2018-09-09 18:53:58 Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun