Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: robertmhaas(at)gmail(dot)com, david(at)pgmasters(dot)net, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, petr(at)2ndquadrant(dot)com, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2016-08-03 01:58:13
Message-ID: 0924a7a9-a219-1366-d3bc-2ddd98bc9269@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

Attached is v19 of the "multivariate stats" patch series - essentially
v18 rebased on top of current master. Aside from a few bug fixes, the
main improvement is addition of SGML docs demonstrating the statistics
in a way similar to the current "Row Estimation Examples" (and the docs
are actually in the same section). I've tried to keep the right amount
of technical detail (and pointing to the right README for additional
details), but this may need improvements. I have not written docs
explaining how statistics may be combined yet (more about this later).

There are two general design questions that I'd like to get feedback on:

1) enriching the query tree with multivariate statistics info

Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty
much all this info is internal to clausesel.c and does not get outside.

I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the
actual estimation, storing the information (which stats to use, etc.) in
a new type of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as
that builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in
EXPLAIN output. With per-column statistics it's fairly straightforward
to determine which statistics are used and how. But with multivariate
stats things are often more complicated - there may be multiple
candidate statistics (e.g. histograms covering different subsets of the
conditions), it's possible to apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only
within clausesel.c (and thrown away after the estimation).

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different subsets of conditions) is important and useful, but I'm
starting to think that the current implementation may not be the correct
one (which is why I haven't written the SGML docs about this part of the
patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch
does about this:

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient, but it only works as long as the query contains conditions
"connecting" the two statistics. So if we remove the "b=2" condition
from the query, this stops working.

But it's possible to do this differently, e.g. by doing this:

P(a=1) * P(c=3|a=1)

where P(c=3|a=1) is using (b,c), but uses (a,b) to restrict the set of
buckets (if the statistics is a histogram) to consider. In pseudo-code,
it might look like this:

buckets = {}
foreach bucket x in (b,c):
foreach bucket y in (a,b):
if y matches (a=1) and overlap(x,y):
buckets := buckets + x

which is the part of (b,c) matching (a=1), allowing us to compute the
conditional probability.

It may get more complicated, of course. In particular, there may be
different types of statistics, and we need to be able to "match" them
against each other. With just MCV lists and histograms that's probably
easy enough, but if we add other types of statistics, it may get way
more complicated.

I still think this is a useful capability, but perhaps there are better
ideas how to do that. In any case, it only affects the last part of the
patch (0006).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
multivariate-stats-v19.tgz application/x-compressed-tar 150.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alfred Perlstein 2016-08-03 02:30:22 Re: Why we lost Uber as a user
Previous Message Michael Paquier 2016-08-03 01:40:23 Re: Increasing timeout of poll_query_until for TAP tests