Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-09-02 23:17:15
Message-ID: 8ac8bd94-478d-215d-e6bd-339f1f20a74c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

Attached is an updated version of the patch series, adopting a couple of
improvements - both for MCV lists and histograms.

MCV
---

For the MCV list part, I've adopted the approach proposed by Dean, using
base selectivity and using it to correct the non-MCV part. I agree the
simplicity of the approach is a nice feature, and it seems to produce
better estimates. I'm not sure I understand the approach perfectly, but
I've tried to add comments explaining how it works etc.

I've also changed how we build the MCV lists, particularly how we decide
how many / which items store in the MCV list. In the previous version
I've adopted the same algorithm we use for per-column MCV lists, but in
certain cases that turned out to be too restrictive.

Consider for example a table with multiple perfectly correlated columns,
with very few combinations. That is, something like this:

CREATE TABLE t (a int, b int);

INSERT INTO t SELECT mod(i,50), mod(i,50)
FROM generate_series(1,1e6) s(i);

CREATE STATISTICS s (mcv) ON a,b FROM t;

Now, the data distribution is very simple - uniform, with 50 distinct
combinations, each representing 2% of data (and the random sample should
be pretty close to that).

In these cases, analyze_mcv_list decides it does not need any MCV list,
because the frequency for each value is pretty much 1/ndistinct. For
single column that's reasonable, but for multiple correlated columns
it's rather problematic. We might use the same ndistinct approach
(assuming we have the ndistinct coefficients), but that still does not
allow us to decide which combinations are "valid" with respect to the
data. For example we can't decide (1,10) does not appear in the data.

So I'm not entirely sure adopting the same algorithm analyze_mcv_list
algorithm both for single-column and multi-column stats. It may make
sense to keep more items in the multi-column case for reasons that are
not really valid for a single single-column.

For now I've added a trivial condition to simply keep all the groups
when possible. This probably needs more thought.

BTW Dean's patch also modified how the maximum number of items on a MCV
list is determined - instead of the shaky defaults I used before, it
derives the size from attstattarget values for the columns, keeping the
maximum value. That seems to make sense, so I've kept this.

histograms
----------

For histograms, I've made the two improvements I mentioned previously.

Firstly, simple equality conditions (of the form "var = const") are
estimated using as 1/ndistinct (possibly using ndistinct coefficients
when available), and then used only as "conditions" (in the "conditional
probability" sense) when estimating the rest of the clauses using the
histogram.

That is P(clauses) is split into two parts

P(clauses) = P(equalities) * P(remaining|clauses)

where the first part is estimated as 1/ndistinct, the second part is
estimated using histogram.

I'm sure this needs more thought, particularly when combining MCV and
histogram estimates. But in general it seems to work quite nicely.

The second improvement is about estimating what fraction of a bucket
matches the conditions. Instead of using the rough 1/2-bucket estimate,
I've adopted the convert_to_scalar approach, computing a geometric mean
for all the clauses (at a bucket level).

I'm not entirely sure the geometric mean is the right approach (or
better than simply using 1/2 the bucket) because multiplying the
per-clause frequencies is mostly equal to assumption of independence at
the bucket level. Which is rather incompatible with the purpose of
multi-column statistics, which are meant to be used exactly when the
columns are not independent.

measurements
------------

I think we need to maintain a set of tests (dataset + query), so that we
can compare impact of various changes in the algorithm. So far we've
used mostly ad-hoc queries, often created as counter-examples, and that
does not seem very practical.

So I'm attaching a simple SQL script that I consider an initial version
of that. It has a couple of synthetic data sets, and queries estimated
with and without extended statistics.

I'm also attaching a spreadsheet with results for (a) the original
version of the patch series, as submitted on 6/24, (b) the new version
attached here and (c) the new version using the per-bucket estimates
directly, without the geometric mean.

Overall, the new versions seem to perform better than the version from
6/24, and also compared to only per-column statistics. There are cases
where extended statistic produce over-estimates, but I find it somewhat
natural due to lower resolution of the multi-column stats.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
stats.ods application/vnd.oasis.opendocument.spreadsheet 30.1 KB
stats-tests.sql application/sql 17.7 KB
0002-multivariate-histograms-20180902.patch text/x-patch 175.1 KB
0001-multivariate-MCV-lists-20180902.patch text/x-patch 148.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-09-03 00:35:00 Re: memory leak when serializing TRUNCATE in reorderbuffer
Previous Message Tom Lane 2018-09-02 22:51:52 Re: libpq debug log