Quick Links

Re: WIP: multivariate statistics / proof of concept

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: multivariate statistics / proof of concept
Date:	2015-03-31 00:26:26
Message-ID:	5519E9B2.5080105@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).

The main changes in the patches (requiring plenty of changes in the
other parts) are about these:

(1) combining multiple statistics on a table
--------------------------------------------

In the previous version of the patch, it was only possible to use a
single statistics on a table - when there was a statistics "covering"
all the conditions it worked fine, but that's not always the case.

The new patch is able to combine multiple statistics by decomposing the
probability (=selectivity) into conditional probabilities. Imagine
estimating selectivity of clauses

WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for
example like this:

P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1)

where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the
second may be simplified like this:

P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1)

using the assumption "no multivariate stats => independent". Both these
probabilities match the existing statistics.

The idea is described a bit more in the part #5 of the patch.

(2) choosing the best combination of statistics
-----------------------------------------------

There may be more statistics on a table, and multiple possible ways to
use them to estimate the clauses (different ordering, overlapping
statistics, etc.).

The patch formulates this as an optimization task with two goals.

(a) cover as many clauses as possible
(b) reuse as many conditions (i.e. dependencies) as possible

and implements two algorithms to solve this: (a) exhaustive, walking
through all possible states (using dynamic programming), and (b) greedy,
choosing the best local solution in each step.

The time requirements for the exhaustive solution grows pretty quickly
with the number of clauses and statistics on a table (~ O(N!)). The
greedy is much faster, as it's ~O(N) and in fact much more time is spent
in actually processing the selected statistics (walking through the
histograms etc.).

I assume the exhaustive search may find a better solution in some cases
(that the greedy algorithm misses), but so far I've been unable to come
up with such example.

To make this easier to test, I've added GUC to switch between these
algorithms easily (set to 'greedy' by default)

mvstat_search = {'greedy', 'exhaustive'}

I assume this GUC will be removed eventually, after we figure out which
algorithm is the right one.

(3) estimation of more complex conditions (AND/OR clauses)
----------------------------------------------------------

I've added ability to estimate more complex clauses - combinations of
AND/OR clauses and such. It's somewhat incomplete at the moment, but
hopefully the ideas will be clear from the TODOs/FIXMEs along the way.

Let me know if you have any questions about this version of the patch,
or about the ideas it implements in general.

I also welcome real-world examples of poorly estimated queries, so that
I can test if these patches improve that particular case situation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
0001-shared-infrastructure-and-functional-dependencies.patch	text/x-diff	65.9 KB
0002-clause-reduction-using-functional-dependencies.patch	text/x-diff	44.5 KB
0003-multivariate-MCV-lists.patch	text/x-diff	109.4 KB
0004-multivariate-histograms.patch	text/x-diff	117.4 KB
0005-multi-statistics-estimation.patch	text/x-diff	71.2 KB
mcv.sql	application/sql	25.8 KB
histogram.sql	application/sql	34.8 KB
dependencies.sql	application/sql	19.0 KB

In response to

Re: WIP: multivariate statistics / proof of concept at 2015-03-24 05:34:27 from Kyotaro HORIGUCHI

Responses

Re: WIP: multivariate statistics / proof of concept at 2015-04-28 16:09:39 from Jeff Janes

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2015-03-31 00:40:14	Re: Exposing PG_VERSION_NUM in pg_config
Previous Message	Mark Kirkwood	2015-03-31 00:14:45	Re: Streaming replication