Re: WIP: multivariate statistics / proof of concept

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2014-11-13 13:11:51
Message-ID: 92f5beef80ea25a899c8b00cc3babd6d.squirrel@2.emaily.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
> On 12 October 2014 23:00, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.

I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.

> If so, I will work with you to review this patch.

Thanks!

> One aspect of the patch that seems to be missing is a user declaration
> of correlation, just as we have for setting n_distinct. It seems like
> an even easier place to start to just let the user specify the stats
> declaratively. That way we can split the patch into two parts. First,
> allow multi column stats that are user declared. Then add user stats
> collected by ANALYZE. The first part is possibly contentious and thus
> a good initial focus. The second part will have lots of discussion, so
> good to skip for a first version.

I'm not a big fan of this approach, for a number of reasons.

Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.

Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.

Another question is whether it should support more than 2 columns ...

The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).

But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-11-13 13:18:22 Re: tracking commit timestamps
Previous Message Petr Jelinek 2014-11-13 12:53:43 Re: tracking commit timestamps