Re: two dimensional statistics in Postgres

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: two dimensional statistics in Postgres
Date: 2014-11-07 19:37:51
Message-ID: 545D1F8F.7030307@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7.11.2014 13:19, Katharina Büchse wrote:
> On 06.11.2014 11:56, Tomas Vondra wrote:
>> Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):
>>>
>>> because correlations might occur only in parts of the data. In this case
>>> a histogram based on a sample of the whole table might not get the point
>>> and wouldn't help for the part of the data the user seems to be
>>> interested in.
>>
>> Yeah, there may be dependencies that are difficult to spot in the whole
>> dataset, but emerge once you filter to a specific subset.
>>
>> Now, how would that work in practice? Initially the query needs to be
>> planned using regular stats (because there's no feedback yet), and then -
>> when we decide the estimates are way off - may be re-planned using the
>> feedback. The feedback is inherently query-specific, so I'm not sure if
>> it's possible to reuse it for multiple queries (might be possible for
>> "sufficiently similar" ones).
>>
>> Would this be done automatically for all queries / all conditions, or
>> only when specifically enabled (on a table, columns, ...)?
>
> The idea is the following: I want to find out correlations with two
> different algorithms, one scanning some samples of the data, the other
> analyzing query feedback.

So you're starting with the default (either single or multivariate)
statistics, computed by ANALYZE from a sample (covering the whole
table). And then compute matching statistics while running the query
(i.e. a sample filtered by the WHERE clauses)?

> If both decide for a column combination that it's correlated, then
> there should be made a "regular histogram" for this combination. If
> only the "scanning"-algorithm says "correlated", then it means that
> there is some correlation, but this is not interesting for the user
> right now.

Isn't it sufficient to simply compare the estimated and observed number
of rows? Either they're sufficiently close, and in that case the
existing stats are good enough (either the columns are independent, or
there are appropriate multivariate stats) - in this case additional
stats are not necessary. Or the estimates are way off, so either there
are no multivariate stats (or are not suitable for this query).

Or do you plan to compute some other stats, possibly more complex,
allowing for more complicated correlation detection?

> So I would only "leave some note" for the optimizer that there is
> correlation and if the user interest changes and query results differ
> a lot from the estimates in the plan, then again -- "regular
> histogram". If only the "feedback"-algorithm decides that the columns
> are correlated, a histogram based on query feedback is the most
> useful choice to support the work of the optimizer.

So the check is peformed only when the query completes, and if there's a
mismatch then you put somewhere a note that those columns are correlated?

I think what exactly is stored in the "note" is crucial. If it's only a
list of columns and "iscorrelated" flag, then I'm not sure how is the
optimizer going to use that directly. I.e. without actual histogram or
at least corrected estimates.

Actually, I'm starting to wonder whether I understand the idea. I'm
aware of the following two kinds of "feedback histograms":

a) building the full histogram from query feedback (STGrid/STHoles)

The main goal is to build and refine a histogram, without
examining the data set directly (not even sampling it).

b) optimizing the set of histograms wrt. to workload (SASH)

Decides what histograms to build, with respect to the observed
workload (i.e. executed queries).

Is the presented similar to one of these, or rather something different?

Actually, the "Introduction" in the CORDS paper says this:

CORDS is a data-driven tool that automatically discovers
correlations and soft functional dependencies (fds) between pairs
of columns and, based on these relationships, recommends a set of
statistics for the query optimizer to maintain.

which seems like the option (b). I haven't read the rest of the paper,
though.

>>> There are special data structures for storing multidimensional
>>> histograms based on feedback and I already tried to implement one
>>> of these in C. In the case of two dimensions they are of course
>>> not "for free" (one dimensional would be much cheaper), but based
>>> on the principle of maximum entropy they deliver really good
>>> results. I decided for only two dimensions because in this case
>>> we have the best proportion of cost and benefit when searching
>>> for correlation (here I'm relying on
>>
>> I think hardcoding the two-dimensions limit is wrong. I understand
>> higher number of dimensions means more expensive operation, but if
>> the user can influence it, I believe it's OK.
>>
> I don't know whether the user has to decide whether the statistical
> data is based on feedback or on data scans. I guess it's enough if
> he gets his histograms in higher dimensions based on data scans.

I wasn't talking about deciding whether to use regular or feedback stats
(although I believe features like this should be opt-in), but about
tweaking the number of dimensions.

For example imagine there are three columns [a,b,c] and you know the
data is somehow correlated. Is it better to build three 2-dimensional
feedback histograms (a,b), (a,c) and (b,c), or a single 3-dimensional
histogram (a,b,c) or maybe something else?

>> Also, is there any particular reason why not to support other kinds
>> of stats (say, MCV lists)? In the end it's just a different way to
>> approximate the distribution, and it may be way cheaper than
>> histograms.
>>
> The reason actually is just that 1) I have only limited time and
> cannot cover every possibility to support the optimizer when there
> is correlation and 2) there are good papers about feedback based
> histograms :-D

OK, understood. Those are completely valid reasons. I was just curious
whether there's some reason that makes the extension to more dimensions
impossible.

>>> tests that were made in DB2 within a project called CORDS which
>>> detects correlations even between different tables).
>>
>> Is this somehow related to LEO? I'm not familiar with the details,
>> but from the description it might be related.
>
> actually, LEO is purely feedback based, while CORDS is using data
> scans. But some authors were involved in both projects I guess. LEO
> itself never made it to be fully integrated in DB2, but some parts of
> it did. What's interesting for me is the fact that in DB2 there's no
> possibility for multidimensional histograms.

OK, so the point is to optimize the set of histograms, similar to what
CORDS does, but based on feedback. Correct?

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-11-07 19:53:37 Re: tracking commit timestamps
Previous Message Michael Banck 2014-11-07 19:19:20 Re: TODO request: log_long_transaction