Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, david(at)pgmasters(dot)net, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, jeff(dot)janes(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-08-15 20:50:17
Message-ID: 2e275433-53c6-cabe-3344-b2ed625c4a95@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> 1) enriching the query tree with multivariate statistics info
>>
>> Right now all the stuff related to multivariate statistics estimation
>> happens in clausesel.c - matching condition to statistics, selection of
>> statistics to use (if there are multiple usable stats), etc. So pretty much
>> all this info is internal to clausesel.c and does not get outside.
>
> This does not seem bad to me as first sight but...
>
>> I'm starting to think that some of the steps (matching quals to stats,
>> selection of stats) should happen in a "preprocess" step before the actual
>> estimation, storing the information (which stats to use, etc.) in a new type
>> of node in the query tree - something like RestrictInfo.
>>
>> I believe this needs to happen sometime after deconstruct_jointree() as that
>> builds RestrictInfos nodes, and looking at planmain.c, right after
>> extract_restriction_or_clauses seems about right. Haven't tried, though.
>>
>> This would move all the "statistics selection" logic from clausesel.c,
>> separating it from the "actual estimation" and simplifying the code.
>>
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.

I've been thinking about this, and I'm afraid it's way more complicated
in practice. It essentially means doing something like

rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as the
name suggests the list should only include RestrictInfo nodes, which
seems to contradict the transformation.

For example with conditions

WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on
(a,b,c), we need to note that somehow - my plan was to inject a node
storing this information, something like (a bit simplified):

StatisticsInfo {
Oid statisticsoid; /* OID of the statistics */
List *mvconditions; /* estimate using the statistics */
List *otherconditions; /* estimate the old way */
}

But that'd clearly violate the assumption that baserestrictinfo only
contains RestrictInfo. I don't think it's feasible (or desirable) to
rework all the places to expect both RestrictInfo and the new node.

I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node it
wants. But it also means we need to maintain two lists instead of one,
and keep them synchronized.

2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e.
each node would track whether it's estimated the "old way" or using
multivariate statistics (and which one). But it would require changing
many of the existing nodes (at least those compatible with multivariate
statistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information during
the estimation, as it'd be necessary to look for other nodes to be
estimated by the same statistics. Which seems to defeat the idea of
preprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e.
keeping a separate list, but I'd welcome other ideas.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2016-08-15 20:58:45 Re: PATCH: Exclude additional directories in pg_basebackup
Previous Message Peter Eisentraut 2016-08-15 20:36:06 Re: Let's get rid of the separate minor version numbers for shlibs