|From:||Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>|
|Subject:||Re: multivariate statistics v8|
|Views:||Raw Message | Whole Thread | Download mbox|
attached is v8 of the multivariate statistics patch (or rather a patch
series). The patch currently has 7 parts, but 0001 is just a fix of the
pull_varnos issue (possibly incorrect/temporary), and 0007 is just an
attempt to add the "multicolumn distinctness" (experimental for now).
There are three noteworthy changes:
1) Correct estimation of OR-clauses - this turned out to be a rather
minor change, thanks to simply transforming the OR-clauses to
AND-clauses, see clauselist_selectivity_or() for details.
2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead
adding separate commands CREATE STATISTICS / DROP STATISTICS, as
proposed in the "multicolumn distinctness" thread:
This seems a better approach than the ALTER TABLE one - not only it
nicely fixes the grammar issues, it also naturally extends to
multi-table statistics (despite we don't know how those should work
The syntax is this:
CREATE STATISTICS name ON table (columns) WITH (options);
DROP STATISTICS name;
and the 'name' is optional (and if absent, should be generated just
like for indexes, but that's not implemented yet).
The remaining question is how unique the statistics name should be.
My initial plan was to make it unique within a table, but that of
course does not work well with the DROP STATISTICS (it'd have to
specify the table name also), and it'd also now work with statistics
on multiple tables (which is one of the reasons for abandoning ALTER
So I think it should be unique across tables. Statistics are hardly
a global object, so it should be unique within a schema. I thought
that simply using the schema of the table would work, but that of
course breaks with multiple tables in different schemas. So the only
solution seems to be explicit schema for statistics.
3) I've also started hacking on adding the "multicolumn distinctness"
proposed by Horiguchi-san, but I haven't really got that working. It
seems to be a bit more complicated than I anticipated because of the
"only equality conditions" restriction. So the 0007 patch only
really adds basic syntax and trivial build.
I do have bunch of ideas/questions about this statistics type. For
example, should we compute just a single coefficient or the exact
combination of columns specified in CREATE STATISTICS, or perhaps
for some additional subsets? I.e. with
CREATE STATISTICS ON t (a,b,c) WITH (ndistinct);
should we compute just the coefficient for (a,b,c), or maybe also
for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such
combinations, but perhaps it's acceptable.
Having the coefficient for just the single combination specified in
CREATE STATISTICS makes the estimation difficult when some of the
columns are not specified. For example, with coefficient just for
(a,b,c), what should happen for (WHERE a=1 AND b=2)?
Should we simply ignore the statistics, or apply it anyway and
somehow compensate for the missing columns?
I've also started working on something like a paper, hopefully
explaining the ideas and implementation more clearly and consistently
than possible on a mailing list (thanks to charts, figures and such).
It's available here (both the .tex source and .pdf with the current
It's not exactly short (~30 pages), and it's certainly incomplete with a
plenty of TODO notes, but hopefully it's already useful and not entirely
Comments and questions are welcome - both to the patch and paper.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
|Next Message||Robert Haas||2015-12-23 19:14:35||Re: parallel joins, and better parallel explain|
|Previous Message||Joshua D. Drake||2015-12-23 18:55:39||Re: Parallel pg_dump's error reporting doesn't work worth squat|