From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |

Cc: | robertmhaas(at)gmail(dot)com, david(at)pgmasters(dot)net, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, petr(at)2ndquadrant(dot)com, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |

Subject: | Re: multivariate statistics (v19) |

Date: | 2016-08-03 01:58:13 |

Message-ID: | 0924a7a9-a219-1366-d3bc-2ddd98bc9269@2ndquadrant.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

Hi,

Attached is v19 of the "multivariate stats" patch series - essentially

v18 rebased on top of current master. Aside from a few bug fixes, the

main improvement is addition of SGML docs demonstrating the statistics

in a way similar to the current "Row Estimation Examples" (and the docs

are actually in the same section). I've tried to keep the right amount

of technical detail (and pointing to the right README for additional

details), but this may need improvements. I have not written docs

explaining how statistics may be combined yet (more about this later).

There are two general design questions that I'd like to get feedback on:

1) enriching the query tree with multivariate statistics info

Right now all the stuff related to multivariate statistics estimation

happens in clausesel.c - matching condition to statistics, selection of

statistics to use (if there are multiple usable stats), etc. So pretty

much all this info is internal to clausesel.c and does not get outside.

I'm starting to think that some of the steps (matching quals to stats,

selection of stats) should happen in a "preprocess" step before the

actual estimation, storing the information (which stats to use, etc.) in

a new type of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as

that builds RestrictInfos nodes, and looking at planmain.c, right after

extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,

separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in

EXPLAIN output. With per-column statistics it's fairly straightforward

to determine which statistics are used and how. But with multivariate

stats things are often more complicated - there may be multiple

candidate statistics (e.g. histograms covering different subsets of the

conditions), it's possible to apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only

within clausesel.c (and thrown away after the estimation).

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering

different subsets of conditions) is important and useful, but I'm

starting to think that the current implementation may not be the correct

one (which is why I haven't written the SGML docs about this part of the

patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're

estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch

does about this:

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then

estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very

efficient, but it only works as long as the query contains conditions

"connecting" the two statistics. So if we remove the "b=2" condition

from the query, this stops working.

But it's possible to do this differently, e.g. by doing this:

P(a=1) * P(c=3|a=1)

where P(c=3|a=1) is using (b,c), but uses (a,b) to restrict the set of

buckets (if the statistics is a histogram) to consider. In pseudo-code,

it might look like this:

buckets = {}

foreach bucket x in (b,c):

foreach bucket y in (a,b):

if y matches (a=1) and overlap(x,y):

buckets := buckets + x

which is the part of (b,c) matching (a=1), allowing us to compute the

conditional probability.

It may get more complicated, of course. In particular, there may be

different types of statistics, and we need to be able to "match" them

against each other. With just MCV lists and histograms that's probably

easy enough, but if we add other types of statistics, it may get way

more complicated.

I still think this is a useful capability, but perhaps there are better

ideas how to do that. In any case, it only affects the last part of the

patch (0006).

regards

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment | Content-Type | Size |
---|---|---|

multivariate-stats-v19.tgz | application/x-compressed-tar | 150.8 KB |

- Re: multivariate statistics v14 at 2016-04-09 17:37:57 from Tatsuo Ishii

- Re: multivariate statistics (v19) at 2016-08-05 04:24:40 from Michael Paquier
- Re: multivariate statistics (v19) at 2016-08-10 04:41:30 from Michael Paquier
- Re: multivariate statistics (v19) at 2016-08-23 17:03:16 from Robert Haas

From | Date | Subject | |
---|---|---|---|

Next Message | Alfred Perlstein | 2016-08-03 02:30:22 | Re: Why we lost Uber as a user |

Previous Message | Michael Paquier | 2016-08-03 01:40:23 | Re: Increasing timeout of poll_query_until for TAP tests |