Re: multivariate statistics v10

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics v10
Date: 2016-03-02 14:56:28
Message-ID: 56D6FF1C.5060303@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is v10 of the patch series. There are 9 parts at the moment:

0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
0002-shared-infrastructure-and-functional-dependencies.patch
0003-clause-reduction-using-functional-dependencies.patch
0004-multivariate-MCV-lists.patch
0005-multivariate-histograms.patch
0006-multi-statistics-estimation.patch
0007-multivariate-ndistinct-coefficients.patch
0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch

However, the first one is still just a temporary workaround that I plan
to address next, and the last 3 are all dealing with the ndistinct
coefficients (and shall be squashed into a single chunk).

README docs
-----------

Aside from fixing a few bugs, there are several major improvements, the
main one being that I've moved most of the comments explaining how it
all works into a set of regular README files, located in
src/backend/utils/mvstats:

1) README.stats - Overview of available types of statistics, what
clauses can be estimated, how multiple statistics are combined etc.
This is probably the right place to start.

2) docs for each type of statistics currently available

README.dependencies - soft functional dependencies
README.mcv - MCV lists
README.histogram - histograms
README.ndistinct - ndistinct coefficients

The READMEs are added and modified through the patch series, so the best
thing to do is apply all the patches and start reading.

I have not improved the user-oriented SGML documentation in this patch,
that's one of the tasks I'd lie to work on next. But the READMEs should
give you a good idea how it's supposed to work, and there are some
examples of use in the regression tests.

Significantly simplified places
-------------------------------

The patch version also significantly simplifies several places that were
needlessly complex in the previous ones - firstly the function
evaluating clauses on multivariate histograms was rather needlessly
bloated, so I've simplified it a lot. Similarly for the code in
clauselist_select() that combines multiple statistics to estimate a list
of clauses - that's much simpler now too. And various other pieces.

That being said, I still think the code in clausesel.c can be
simplified. I feel there's a lot of cruft, mostly due to unknowingly
implementing something that could be solved by an existing function.

A prime example of that is inspecting the expression tree to check if we
know how to estimate the clauses using the multivariate statistics. That
sounds like a nice match for expression walker, but currently is done by
custom code. I plan to look at that next.

Also, I'm not quite sure I understand what the varRelid parameter of
clauselist_selectivity is for, so the code may be handling that wrong
(seems to be working though).

ndistinct coefficients
----------------------

The one new piece in this patch is the GROUP BY estimation, based on the
ndistinct coefficients. So for example you can do this:

CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
FROM generate_series(1,1000000) s(i);
ANALYZE t;
EXPLAIN SELECT * FROM t GROUP BY a, b;

which currently does this:

QUERY PLAN
-----------------------------------------------------------------------
Group (cost=127757.34..135257.34 rows=99996 width=8)
Group Key: a, b
-> Sort (cost=127757.34..130257.34 rows=1000000 width=8)
Sort Key: a, b
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
(5 rows)

but we know that there are only 1000 groups because the columns are
correlated. So let's create ndistinct statistics on the two columns:

CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
ANALYZE t;

which results in estimates like this:

QUERY PLAN
-----------------------------------------------------------------
HashAggregate (cost=19425.00..19435.00 rows=1000 width=8)
Group Key: a, b
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
(3 rows)

I'm not quite sure how to combine this type of statistics with MCV lists
and histograms, so for now it's used only for GROUP BY.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch binary/octet-stream 1.4 KB
0002-shared-infrastructure-and-functional-dependencies.patch binary/octet-stream 106.3 KB
0003-clause-reduction-using-functional-dependencies.patch binary/octet-stream 45.8 KB
0004-multivariate-MCV-lists.patch binary/octet-stream 119.0 KB
0005-multivariate-histograms.patch binary/octet-stream 138.3 KB
0006-multi-statistics-estimation.patch binary/octet-stream 95.9 KB
0007-multivariate-ndistinct-coefficients.patch binary/octet-stream 28.5 KB
0008-change-how-we-apply-selectivity-to-number-of-groups-.patch binary/octet-stream 1.4 KB
0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch binary/octet-stream 6.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-03-02 15:18:48 Re: pg_dump / copy bugs with "big lines" ?
Previous Message David Steele 2016-03-02 14:54:58 2016-03 Commitfest