Re: multivariate statistics v10

From: Thom Brown <thom(at)linux(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics v10
Date: 2016-03-02 16:17:17
Message-ID: CAA-aLv5cOaotb6-=mOEjVvm3Afh9638Ao+4UeczyZZkuQfHF8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2 March 2016 at 14:56, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Hi,
>
> Attached is v10 of the patch series. There are 9 parts at the moment:
>
> 0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
> 0002-shared-infrastructure-and-functional-dependencies.patch
> 0003-clause-reduction-using-functional-dependencies.patch
> 0004-multivariate-MCV-lists.patch
> 0005-multivariate-histograms.patch
> 0006-multi-statistics-estimation.patch
> 0007-multivariate-ndistinct-coefficients.patch
> 0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
> 0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch
>
> However, the first one is still just a temporary workaround that I plan to address next, and the last 3 are all dealing with the ndistinct coefficients (and shall be squashed into a single chunk).
>
>
> README docs
> -----------
>
> Aside from fixing a few bugs, there are several major improvements, the main one being that I've moved most of the comments explaining how it all works into a set of regular README files, located in src/backend/utils/mvstats:
>
> 1) README.stats - Overview of available types of statistics, what
> clauses can be estimated, how multiple statistics are combined etc.
> This is probably the right place to start.
>
> 2) docs for each type of statistics currently available
>
> README.dependencies - soft functional dependencies
> README.mcv - MCV lists
> README.histogram - histograms
> README.ndistinct - ndistinct coefficients
>
> The READMEs are added and modified through the patch series, so the best thing to do is apply all the patches and start reading.
>
> I have not improved the user-oriented SGML documentation in this patch, that's one of the tasks I'd lie to work on next. But the READMEs should give you a good idea how it's supposed to work, and there are some examples of use in the regression tests.
>
>
> Significantly simplified places
> -------------------------------
>
> The patch version also significantly simplifies several places that were needlessly complex in the previous ones - firstly the function evaluating clauses on multivariate histograms was rather needlessly bloated, so I've simplified it a lot. Similarly for the code in clauselist_select() that combines multiple statistics to estimate a list of clauses - that's much simpler now too. And various other pieces.
>
> That being said, I still think the code in clausesel.c can be simplified. I feel there's a lot of cruft, mostly due to unknowingly implementing something that could be solved by an existing function.
>
> A prime example of that is inspecting the expression tree to check if we know how to estimate the clauses using the multivariate statistics. That sounds like a nice match for expression walker, but currently is done by custom code. I plan to look at that next.
>
> Also, I'm not quite sure I understand what the varRelid parameter of clauselist_selectivity is for, so the code may be handling that wrong (seems to be working though).
>
>
> ndistinct coefficients
> ----------------------
>
> The one new piece in this patch is the GROUP BY estimation, based on the ndistinct coefficients. So for example you can do this:
>
> CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
> FROM generate_series(1,1000000) s(i);
> ANALYZE t;
> EXPLAIN SELECT * FROM t GROUP BY a, b;
>
> which currently does this:
>
> QUERY PLAN
> -----------------------------------------------------------------------
> Group (cost=127757.34..135257.34 rows=99996 width=8)
> Group Key: a, b
> -> Sort (cost=127757.34..130257.34 rows=1000000 width=8)
> Sort Key: a, b
> -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
> (5 rows)
>
> but we know that there are only 1000 groups because the columns are correlated. So let's create ndistinct statistics on the two columns:
>
> CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
> ANALYZE t;
>
> which results in estimates like this:
>
> QUERY PLAN
> -----------------------------------------------------------------
> HashAggregate (cost=19425.00..19435.00 rows=1000 width=8)
> Group Key: a, b
> -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
> (3 rows)
>
> I'm not quite sure how to combine this type of statistics with MCV lists and histograms, so for now it's used only for GROUP BY.

Well, firstly, the patches all apply.

But I have a question (which is coming really late, but I'll ask it
anyway). Is it intended that CREATE STATISTICS will only be for
multivariate statistics? Or do you think we could add support for
expression statistics in future too?

e.g.

CREATE STATISTICS stats_comment_length ON comments (length(comment));

I also note that the docs contain this:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON table_name ( [
{ column_name } ] [, ...])
[ WITH ( statistics_parameter [= value] [, ... ] )

The open square bracket before WITH doesn't get closed. Also, it
indicates that columns are entirely options, so () would be valid, but
that's not the case. Also, a space is missing after the first
ellipsis. So I think this should read:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON table_name (
{ column_name } [, ... ])
[ WITH ( statistics_parameter [= value] [, ... ] ) ]

Regards

Thom

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2016-03-02 16:33:04 Re: pg_dump / copy bugs with "big lines" ?
Previous Message Shulgin, Oleksandr 2016-03-02 16:10:37 Re: More stable query plans via more predictable column statistics