Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-09-04 14:16:28
Message-ID: CAEZATCUQSgWZSjN0G5kM8D+0zuAK6nkxLmRwSja3K1u1k_sM3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 September 2018 at 00:17, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> Attached is an updated version of the patch series, adopting a couple of
> improvements - both for MCV lists and histograms.
>
>
> MCV
> ---
>
> For the MCV list part, I've adopted the approach proposed by Dean, using
> base selectivity and using it to correct the non-MCV part. I agree the
> simplicity of the approach is a nice feature, and it seems to produce
> better estimates. I'm not sure I understand the approach perfectly, but
> I've tried to add comments explaining how it works etc.
>

Cool. Looking at this afresh after some time away, it still looks like
a reasonable approach, and the test results are encouraging.

In mcv_clauselist_selectivity(), you raise the following question:

if (matches[i] != STATS_MATCH_NONE)
{
/* XXX Shouldn't the basesel be outside the if condition? */
*basesel += mcv->items[i]->base_frequency;
s += mcv->items[i]->frequency;
}

The reason it needs to be inside the "if" block is that what it's
computing is the base (independent) selectivity of the clauses found
to match the MCV stats, so that then in
statext_clauselist_selectivity() it can be used in the following
computation:

/* Estimated selectivity of values not covered by MCV matches */
other_sel = simple_sel - mcv_basesel;

to give an estimate for the other clauses that aren't covered by the
MCV stats. So I think the code is correct as it stands, but if you
think it isn't clear enough, maybe a comment update is in order.

The assumption being made is that mcv_basesel will cancel out the part
of simple_sel that is due to clauses matching the MCV stats, so that
we can then just add on the MCV selectivity. Of course that's only an
approximation, and it won't be exact -- partly due to the fact that
simple_sel and mcv_basesel are potentially computed using different
sample rows, and so will differ in the MCV region, and partly because
of second-order effects arising from the way that selectivities are
combined in clauselist_selectivity_simple(). Maybe that's something
that can be improved upon, but it doesn't seem like a bad initial
approximation.

> I've also changed how we build the MCV lists, particularly how we decide
> how many / which items store in the MCV list. In the previous version
> I've adopted the same algorithm we use for per-column MCV lists, but in
> certain cases that turned out to be too restrictive.
>
> Consider for example a table with multiple perfectly correlated columns,
> with very few combinations. That is, something like this:
>
> CREATE TABLE t (a int, b int);
>
> INSERT INTO t SELECT mod(i,50), mod(i,50)
> FROM generate_series(1,1e6) s(i);
>
> CREATE STATISTICS s (mcv) ON a,b FROM t;
>
> Now, the data distribution is very simple - uniform, with 50 distinct
> combinations, each representing 2% of data (and the random sample should
> be pretty close to that).
>
> In these cases, analyze_mcv_list decides it does not need any MCV list,
> because the frequency for each value is pretty much 1/ndistinct. For
> single column that's reasonable, but for multiple correlated columns
> it's rather problematic. We might use the same ndistinct approach
> (assuming we have the ndistinct coefficients), but that still does not
> allow us to decide which combinations are "valid" with respect to the
> data. For example we can't decide (1,10) does not appear in the data.
>
> So I'm not entirely sure adopting the same algorithm analyze_mcv_list
> algorithm both for single-column and multi-column stats. It may make
> sense to keep more items in the multi-column case for reasons that are
> not really valid for a single single-column.
>
> For now I've added a trivial condition to simply keep all the groups
> when possible. This probably needs more thought.
>

Ah, this is a good point. I think I see the problem here.

analyze_mcv_list() works by keeping those MCV entries that are
statistically significantly more frequent than the selectivity that
would have otherwise been assigned to the values, which is based on
ndistinct and nullfrac. That's not really right for multivariate stats
though, because the selectivity that would be assigned to a
multi-column value if it weren't in the multivariate MCV list is
actually calculated using the product of individual column
selectivities. Fortunately we now calculate this (base_frequency), so
actually I think what's needed is a custom version of
analyze_mcv_list() that keeps MCV entries if the observed frequency is
statistically significantly larger than the base frequency, not the
ndistinct-based frequency.

It might also be worthwhile doing a little more work to make the
base_frequency values more consistent with the way individual column
selectivities are actually calculated -- currently the patch always
uses the observed single-column frequencies to calculate the base
frequencies, but actually the univariate stats would only do that for
a subset of the single-column values, and the rest would get assigned
a fixed share of the remaining selectivity-space. Factoring that into
the base frequency calculation ought to give a better base frequency
estimate (for use in mcv_clauselist_selectivity() and
statext_clauselist_selectivity()), as well as give a more principled
cutoff threshold for deciding which multivariate MCV values to keep.
It may be possible to reuse some of the existing code for that.

The initial goal of the base frequency calculation was to replicate
the univariate stats computations, so that it can be used to give the
right correction to be applied to the simple_sel value. If it can also
be used to determine how many MCV entries to keep, that's an added
bonus.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-09-04 14:52:07 pgsql: Clean up after TAP tests in oid2name and vacuumlo.
Previous Message Stas Kelvich 2018-09-04 14:15:39 Re: jsonpath