Re: WIP: multivariate statistics / proof of concept

From: Katharina Büchse <katharina(dot)buechse(at)uni-jena(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2014-11-13 15:51:57
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13.11.2014 14:11, Tomas Vondra wrote:
> Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
>> On 12 October 2014 23:00, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>> It however seems to be working sufficiently well at this point, enough
>>> to get some useful feedback. So here we go.
>> This looks interesting and useful.
>> What I'd like to check before a detailed review is that this has
>> sufficient applicability to be useful.
>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>> result of multi-column stats errors.
>> Could you look at those queries and confirm that this patch can
>> produce better plans for them?
> Sure. I planned to do such verification/demonstration anyway, after
> discussing the overall approach.
> I planned to give it a try on TPC-DS, but I can start with the TPC-H
> queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
> come from column correlation though - if it's due to some other issues
> (e.g. conditions that are difficult to estimate), this patch can't do
> anything with them. But it's a good start.
>> If so, I will work with you to review this patch.
> Thanks!
>> One aspect of the patch that seems to be missing is a user declaration
>> of correlation, just as we have for setting n_distinct. It seems like
>> an even easier place to start to just let the user specify the stats
>> declaratively. That way we can split the patch into two parts. First,
>> allow multi column stats that are user declared. Then add user stats
>> collected by ANALYZE. The first part is possibly contentious and thus
>> a good initial focus. The second part will have lots of discussion, so
>> good to skip for a first version.
> I'm not a big fan of this approach, for a number of reasons.
> Firstly, it only works for "simple" parameters that are trivial to specify
> (say, Pearson's correlation coefficient), and the patch does not work with
> those at all - it only works with histograms, MCV lists (and might work
> with associative rules in the future). And we certainly can't ask users to
> specify multivariate histograms - because it's very difficult to do, and
> also because complex stats are more susceptible to get stale after adding
> new data to the table.
> Secondly, even if we add such "simple" parameters to the patch, we have to
> come up with a way to apply those parameters to the estimates. The
> problem is that as the parameters get simpler, it's less and less useful
> to compute the stats.
> Another question is whether it should support more than 2 columns ...
> The only place where I think this might work are the associative rules.
> It's simple to specify rules like ("ZIP code" implies "city") and we could
> even do some simple check against the data to see if it actually makes
> sense (and 'disable' the rule if not).
and even this simple example has its limits, at least in Germany ZIP
codes are not unique for rural areas, where several villages have the
same ZIP code.

I guess there are just a few examples where columns are completely
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization
the statistics, some exceptions don't matter.
If this information should be used for creating different execution
plans (e.g. on column A is an index and column B is functional
dependent, one could think about using this index on A and the
dependency instead of running through the whole table to find all tuples
that fit the query on column B), exceptions are a very important issue.
> But maybe I got it wrong and you have something particular in mind? Can
> you give an example of how it would work?
> regards
> Tomas

Dipl.-Math. Katharina Büchse
Friedrich-Schiller-Universität Jena
Institut für Informatik
Lehrstuhl für Datenbanken und Informationssysteme
Ernst-Abbe-Platz 2
07743 Jena
Telefon 03641/946367

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-11-13 15:52:01 controlling psql's use of the pager a bit more
Previous Message Tom Lane 2014-11-13 15:32:50 Re: On the warpath again about ill-considered inclusion nests