Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-14 14:21:36
Message-ID: 55A51AF0.7020501@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 07/13/2015 10:51 AM, Kyotaro HORIGUCHI wrote:
>
> Ok, I understood the diferrence between what I thought and what you
> say. The code is actually concious of OR clause but is looks somewhat
> confused.

I'm not sure which part is confused by the OR clauses, but it's
certainly possible. Initially it only handled AND clauses, and the
support for OR clauses was added later, so it's possible some parts are
not behaving correctly.

>
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
>
> 1. find_stats finds candidate mv stats containing *all*
> attributes appeared in the whole clauses regardless of and/or
> exprs by walking whole the clause tree.
>
> Perhaps this is the measure to early bailout.

Not entirely. The goal of find_stats() is to lookup all stats on the
'current' relation - it's coded the way it is because I had to deal with
varRelid=0 cases, in which case I have to inspect the Var nodes. But
maybe I got this wrong and there's much simpler way to do that?

It is an early bailout in the sense that if there are no multivariate
stats defined on the table, there's no point in doing any of the
following steps. So that we don't increase planning times for users not
using multivariate stats.

> 2.1. Within every disjunction elements, collect mv-related
> attributes while checking whether the all leaf nodes (binop or
> ifnull) are compatible by (eventually) walking whole the
> clause tree.

Generally, yes. The idea is to check whether there are clauses that
might be estimated using multivariate statistics, and whether the
clauses reference at least two different attributes. Imagine a query
like this:

SELECT * FROM t WHERE (a=1) AND (a>0) AND (a<100)

It makes no sense to process this using multivariate statistics, because
all the Var nodes reference a single attribute.

Similarly, the check is not just about the leaf nodes - to be able to
estimate a clause at this point, we have to be able to process the whole
tree, starting from the top-level clause. Although maybe that's no
longer true, now that support for OR clauses was added ... I wonder
whether there are other BoolExpr-like nodes, that might make the tree
incompatible with multivariate statistics (in the sense that the current
implementation does not know how to handle them).

Also note that even though the clause may be "incompatible" at this
level, it may get partially processed by multivariate statistics later.
For example with a query:

SELECT * FROM t WHERE (a=1 OR b=2 OR c ~* 'xyz') AND (q=1 OR r=4)

the first query is "incompatible" because it contains unsupported
operator '~*', but it will eventually be processed as BoolExpr node, and
should be split into two parts - (a=1 OR b=2) which is compatible, and
(c ~* 'xyz') which is incompatible.

This split should happen in clauselist_selectivity_or(), and the other
thing that may be interesting is that it uses (q=1 OR r=4) as a
condition. So if there's a statistics built on (a,b,q,r) we'll compute
conditional probability

P(a=1,b=2 | q=1,r=4)

>
> 2.2. Check if all the collected attribute are contained in
> mv-stats columns.

No, I think you got this wrong. We do not check that *all* the
attributes are contained in mvstats columns - we only need two such
columns (then there's a chance that the multivariate statistics will get
applied).

Anyway, both 2.1 and 2.2 are meant as a quick bailout, before doing the
most expensive part, which is choose_mv_statistics(). Which is however
missing in this list.

> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
>
> This funciton applies every ExprOp onto every attribute in
> every histogram backes and (tries to) make the boolean
> operation of the result bitmaps.

Yes, but this only happens after choose_mv_statistics(), because that's
the code that decides which statistics will be used and in what order.

The list is also missing handling of the 'functional dependencies', so a
complete list of steps would look like this:

1) find_stats - lookup stats on the current relation (from RelOptInfo)

2) apply functional dependencies

a) check if there are equality clauses that may be reduced using
functional dependencies, referencing at least two columns

b) if yes, perform the clause reduction

3) apply MCV lists and histograms

a) check if there are clauses 'compatible' with those types of
statistics, again containing at least two columns

b) if yes, use choose_mv_statistics() to decide which statistics to
apply and in which order

c) apply the selected histograms and MCV lists

4) estimate the remaining clauses using the regular statistics

a) this is where the clauselist_mv_selectivity_histogram and other
are called

I tried to explain this in the comment before clauselist_selectivity(),
but maybe it's not detailed enough / missing some important details.

>
> I have some comments on the implement and I also try to find the
> solution for them.
>
>
> 1. The flow above looks doing very similiar thins repeatedly.

I worked hard to remove such code duplicities, and believe all the
current steps are necessary - for example 2(a) and 3(a) may seems
similar, but it's really necessary to do that twice.

>
> 2. I believe what the current code does can be simplified.

Possibly.

>
> 3. As you mentioned in comments, some additional infrastructure
> needed.
>
> After all, I think what we should do after this are as follows,
> as the first step.

OK.

>
> - Add the means to judge the selectivity operator(?) by other
> than oprrest of the op of ExprOp. (You missed neqsel already)

Yes, the way we use 'oprno' to determine how to estimate the selectivity
is a bit awkward. It's inspired by handling of range queries, and having
something better would be nice.

But I don't think this is the reason why I missed neqsel, and I don't
see this as a significant design issue at this point. But if we can come
up with a better solution, why not ...

> I suppose one solution for this is adding oprmvstats taking
> 'm', 'h' and 'f' and their combinations. Or for the
> convenience, it would be a fixed-length string like this.
>
> oprname | oprmvstats
> = | 'mhf'
> <> | 'mhf'
> < | 'mh-'
> > | 'mh-'
> >= | 'mh-'
> <= | 'mh-'
>
> This would make the code in clause_is_mv_compatible like this.
>
> > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
> > if (oprmvstats & types)
> > {
> > *attnums = bms_add_member(*attnums, var->varattno);
> > return true;
> > }
> > return false;

So this only determines the compatibility of operators with respect to
different types of statistics? How does that solve the neqsel case? It
will probably decide the clause is compatible, but it will later fail at
the actual estimation, no?

>
> - Current design just manage to work but it is too complicated
> and hardly have affinity with the existing estimation
> framework.

I respectfully disagree. I've strived to make it as affine to the
current implementation as possible - maybe it's possible to improve
that, but I believe there's a natural difference between the two types
of statistics. It may be somewhat simplified, but it will never be
exactly the same.

> I proposed separation of finding stats phase and
> calculation phase, but I would like to propose transforming
> RestrictInfo(and finding mvstat) phase and running the
> transformed RestrictInfo phase after looking close to the
> patch.

Those phases are already separated, as is illustrated by the steps
explained above.

So technically we might place a hook either right after the find_stats()
call, so that it's possible to process all the stats on the table, or
maybe after the choose_mv_statistics() call, so that we only process the
actually used stats.

>
> I think transforing RestrictInfo makes the situnation
> better. Since it nedds different information, maybe it is
> better to have new struct, say, RestrictInfoForEstimate
> (boo!). Then provide mvstatssel() to use in the new struct.
> The rough looking of the code would be like below.
>
> clauselist_selectivity()
> {
> ...
> RestrictInfoForEstmate *esclause =
> transformClauseListForEstimation(root, clauses, varRelid);
> ...
>
> return clause_selectivity(esclause):
> }
>
> clause_selectivity(RestrictInfoForEstmate *esclause)
> {
> if (IsA(clause, RestrictInfo))...
> if (IsA(clause, RestrictInfoForEstimate))
> {
> RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
> if (ecl->selfunc)
> {
> sx = ecl->selfunc(root, ecl);
> }
> }
> if (IsA(clause, Var))...
> }
>
>
> transformClauseListForEstimation(...)
> {
> ...
>
> relid = collect_mvstats_info(root, clause, &attlist);
> if (!relid) return;
> if (get_mvstats_hook)
> mvstats = (*get_mvstats_hoook) (root, relid, attset);
> else
> mvstats = find_mv_stats(root, relid, attset))
> }
> ...

So you'd transform the clause tree first, replacing parts of the tree
(to be estimated by multivariate stats) by a new node type? That's an
interesting idea, I think ...

I can't really say whether it's a good approach, though. Can you explain
why do you think it'd make the situation better?

The one benefit I can think of is being able to look at the processed
tree and see which parts will be estimated using multivariate stats.

But we'd effectively have to do the same stuff (choosing the stats,
...), and if we move this pre-processing before clauselist_selectivity
(I assume that's the point), we'd end up repeating a lot of the code. Or
maybe not, I'm not sure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-07-14 14:25:10 Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Previous Message Heikki Linnakangas 2015-07-14 13:25:46 Re: pgsql: Retain comments on indexes and constraints at ALTER TABLE ... TY