From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |

Cc: | pgsql-hackers(at)postgresql(dot)org |

Subject: | Re: WIP: multivariate statistics / proof of concept |

Date: | 2015-03-31 00:26:26 |

Message-ID: | 5519E9B2.5080105@2ndquadrant.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

Hello,

attached is a new version of the patch series. Aside from fixing various

issues (crashes, memory leaks). The patches are rebased to current

master, and I also attach a few SQL scripts I used for testing (nothing

fancy, just stress-testing all the parts the patch touches).

The main changes in the patches (requiring plenty of changes in the

other parts) are about these:

(1) combining multiple statistics on a table

--------------------------------------------

In the previous version of the patch, it was only possible to use a

single statistics on a table - when there was a statistics "covering"

all the conditions it worked fine, but that's not always the case.

The new patch is able to combine multiple statistics by decomposing the

probability (=selectivity) into conditional probabilities. Imagine

estimating selectivity of clauses

WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for

example like this:

P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1)

where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the

second may be simplified like this:

P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1)

using the assumption "no multivariate stats => independent". Both these

probabilities match the existing statistics.

The idea is described a bit more in the part #5 of the patch.

(2) choosing the best combination of statistics

-----------------------------------------------

There may be more statistics on a table, and multiple possible ways to

use them to estimate the clauses (different ordering, overlapping

statistics, etc.).

The patch formulates this as an optimization task with two goals.

(a) cover as many clauses as possible

(b) reuse as many conditions (i.e. dependencies) as possible

and implements two algorithms to solve this: (a) exhaustive, walking

through all possible states (using dynamic programming), and (b) greedy,

choosing the best local solution in each step.

The time requirements for the exhaustive solution grows pretty quickly

with the number of clauses and statistics on a table (~ O(N!)). The

greedy is much faster, as it's ~O(N) and in fact much more time is spent

in actually processing the selected statistics (walking through the

histograms etc.).

I assume the exhaustive search may find a better solution in some cases

(that the greedy algorithm misses), but so far I've been unable to come

up with such example.

To make this easier to test, I've added GUC to switch between these

algorithms easily (set to 'greedy' by default)

mvstat_search = {'greedy', 'exhaustive'}

I assume this GUC will be removed eventually, after we figure out which

algorithm is the right one.

(3) estimation of more complex conditions (AND/OR clauses)

----------------------------------------------------------

I've added ability to estimate more complex clauses - combinations of

AND/OR clauses and such. It's somewhat incomplete at the moment, but

hopefully the ideas will be clear from the TODOs/FIXMEs along the way.

Let me know if you have any questions about this version of the patch,

or about the ideas it implements in general.

I also welcome real-world examples of poorly estimated queries, so that

I can test if these patches improve that particular case situation.

regards

--

Tomas Vondra http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment | Content-Type | Size |
---|---|---|

0001-shared-infrastructure-and-functional-dependencies.patch | text/x-diff | 65.9 KB |

0002-clause-reduction-using-functional-dependencies.patch | text/x-diff | 44.5 KB |

0003-multivariate-MCV-lists.patch | text/x-diff | 109.4 KB |

0004-multivariate-histograms.patch | text/x-diff | 117.4 KB |

0005-multi-statistics-estimation.patch | text/x-diff | 71.2 KB |

mcv.sql | application/sql | 25.8 KB |

histogram.sql | application/sql | 34.8 KB |

dependencies.sql | application/sql | 19.0 KB |

- Re: WIP: multivariate statistics / proof of concept at 2015-03-24 05:34:27 from Kyotaro HORIGUCHI

- Re: WIP: multivariate statistics / proof of concept at 2015-04-28 16:09:39 from Jeff Janes

From | Date | Subject | |
---|---|---|---|

Next Message | Peter Eisentraut | 2015-03-31 00:40:14 | Re: Exposing PG_VERSION_NUM in pg_config |

Previous Message | Mark Kirkwood | 2015-03-31 00:14:45 | Re: Streaming replication |