From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | pgsql-hackers(at)postgresql(dot)org |

Subject: | Re: multivariate statistics v8 |

Date: | 2015-12-23 19:07:48 |

Message-ID: | 567AF104.80700@2ndquadrant.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

Hi,

attached is v8 of the multivariate statistics patch (or rather a patch

series). The patch currently has 7 parts, but 0001 is just a fix of the

pull_varnos issue (possibly incorrect/temporary), and 0007 is just an

attempt to add the "multicolumn distinctness" (experimental for now).

There are three noteworthy changes:

1) Correct estimation of OR-clauses - this turned out to be a rather

minor change, thanks to simply transforming the OR-clauses to

AND-clauses, see clauselist_selectivity_or() for details.

2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead

adding separate commands CREATE STATISTICS / DROP STATISTICS, as

proposed in the "multicolumn distinctness" thread:

http://www.postgresql.org/message-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp

This seems a better approach than the ALTER TABLE one - not only it

nicely fixes the grammar issues, it also naturally extends to

multi-table statistics (despite we don't know how those should work

exactly).

The syntax is this:

CREATE STATISTICS name ON table (columns) WITH (options);

DROP STATISTICS name;

and the 'name' is optional (and if absent, should be generated just

like for indexes, but that's not implemented yet).

The remaining question is how unique the statistics name should be.

My initial plan was to make it unique within a table, but that of

course does not work well with the DROP STATISTICS (it'd have to

specify the table name also), and it'd also now work with statistics

on multiple tables (which is one of the reasons for abandoning ALTER

TABLE stuff).

So I think it should be unique across tables. Statistics are hardly

a global object, so it should be unique within a schema. I thought

that simply using the schema of the table would work, but that of

course breaks with multiple tables in different schemas. So the only

solution seems to be explicit schema for statistics.

3) I've also started hacking on adding the "multicolumn distinctness"

proposed by Horiguchi-san, but I haven't really got that working. It

seems to be a bit more complicated than I anticipated because of the

"only equality conditions" restriction. So the 0007 patch only

really adds basic syntax and trivial build.

I do have bunch of ideas/questions about this statistics type. For

example, should we compute just a single coefficient or the exact

combination of columns specified in CREATE STATISTICS, or perhaps

for some additional subsets? I.e. with

CREATE STATISTICS ON t (a,b,c) WITH (ndistinct);

should we compute just the coefficient for (a,b,c), or maybe also

for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such

combinations, but perhaps it's acceptable.

Having the coefficient for just the single combination specified in

CREATE STATISTICS makes the estimation difficult when some of the

columns are not specified. For example, with coefficient just for

(a,b,c), what should happen for (WHERE a=1 AND b=2)?

Should we simply ignore the statistics, or apply it anyway and

somehow compensate for the missing columns?

I've also started working on something like a paper, hopefully

explaining the ideas and implementation more clearly and consistently

than possible on a mailing list (thanks to charts, figures and such).

It's available here (both the .tex source and .pdf with the current

version):

https://bitbucket.org/tvondra/mvstats-paper/src

It's not exactly short (~30 pages), and it's certainly incomplete with a

plenty of TODO notes, but hopefully it's already useful and not entirely

bogus.

Comments and questions are welcome - both to the patch and paper.

regards

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment | Content-Type | Size |
---|---|---|

0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch | text/x-diff | 1.4 KB |

0002-shared-infrastructure-and-functional-dependencies.patch | text/x-diff | 96.3 KB |

0003-clause-reduction-using-functional-dependencies.patch | text/x-diff | 48.1 KB |

0004-multivariate-MCV-lists.patch | text/x-diff | 115.3 KB |

0005-multivariate-histograms.patch | text/x-diff | 140.6 KB |

0006-multi-statistics-estimation.patch | text/x-diff | 101.6 KB |

0007-initial-version-of-ndistinct-conefficient-statistics.patch | text/x-diff | 17.7 KB |

- WIP: multivariate statistics / proof of concept at 2014-10-12 22:00:53 from Tomas Vondra

- Re: multivariate statistics v9 at 2016-01-19 04:24:07 from Tomas Vondra
- Re: multivariate statistics v8 at 2016-01-20 19:20:38 from Robert Haas

From | Date | Subject | |
---|---|---|---|

Next Message | Robert Haas | 2015-12-23 19:14:35 | Re: parallel joins, and better parallel explain |

Previous Message | Joshua D. Drake | 2015-12-23 18:55:39 | Re: Parallel pg_dump's error reporting doesn't work worth squat |