Re: WIP: multivariate statistics / proof of concept

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2015-03-31 00:26:26
Message-ID: 5519E9B2.5080105@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).

The main changes in the patches (requiring plenty of changes in the
other parts) are about these:

(1) combining multiple statistics on a table
--------------------------------------------

In the previous version of the patch, it was only possible to use a
single statistics on a table - when there was a statistics "covering"
all the conditions it worked fine, but that's not always the case.

The new patch is able to combine multiple statistics by decomposing the
probability (=selectivity) into conditional probabilities. Imagine
estimating selectivity of clauses

WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for
example like this:

P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1)

where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the
second may be simplified like this:

P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1)

using the assumption "no multivariate stats => independent". Both these
probabilities match the existing statistics.

The idea is described a bit more in the part #5 of the patch.

(2) choosing the best combination of statistics
-----------------------------------------------

There may be more statistics on a table, and multiple possible ways to
use them to estimate the clauses (different ordering, overlapping
statistics, etc.).

The patch formulates this as an optimization task with two goals.

(a) cover as many clauses as possible
(b) reuse as many conditions (i.e. dependencies) as possible

and implements two algorithms to solve this: (a) exhaustive, walking
through all possible states (using dynamic programming), and (b) greedy,
choosing the best local solution in each step.

The time requirements for the exhaustive solution grows pretty quickly
with the number of clauses and statistics on a table (~ O(N!)). The
greedy is much faster, as it's ~O(N) and in fact much more time is spent
in actually processing the selected statistics (walking through the
histograms etc.).

I assume the exhaustive search may find a better solution in some cases
(that the greedy algorithm misses), but so far I've been unable to come
up with such example.

To make this easier to test, I've added GUC to switch between these
algorithms easily (set to 'greedy' by default)

mvstat_search = {'greedy', 'exhaustive'}

I assume this GUC will be removed eventually, after we figure out which
algorithm is the right one.

(3) estimation of more complex conditions (AND/OR clauses)
----------------------------------------------------------

I've added ability to estimate more complex clauses - combinations of
AND/OR clauses and such. It's somewhat incomplete at the moment, but
hopefully the ideas will be clear from the TODOs/FIXMEs along the way.

Let me know if you have any questions about this version of the patch,
or about the ideas it implements in general.

I also welcome real-world examples of poorly estimated queries, so that
I can test if these patches improve that particular case situation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-shared-infrastructure-and-functional-dependencies.patch text/x-diff 65.9 KB
0002-clause-reduction-using-functional-dependencies.patch text/x-diff 44.5 KB
0003-multivariate-MCV-lists.patch text/x-diff 109.4 KB
0004-multivariate-histograms.patch text/x-diff 117.4 KB
0005-multi-statistics-estimation.patch text/x-diff 71.2 KB
mcv.sql application/sql 25.8 KB
histogram.sql application/sql 34.8 KB
dependencies.sql application/sql 19.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2015-03-31 00:40:14 Re: Exposing PG_VERSION_NUM in pg_config
Previous Message Mark Kirkwood 2015-03-31 00:14:45 Re: Streaming replication