Re: multivariate statistics / patch v7

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-27 07:04:38
Message-ID: 20150727.160438.264168398.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <55B3FB0B(dot)7000201(at)2ndquadrant(dot)com>
> Hi,
>
> On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
> > Hi, I'd like to show you the modified constitution of
> > multivariate statistics application logic. Please find the
> > attached. They apply on your v7 patch.
>
> Sadly I do have some trouble getting it to apply correctly :-(
> So for now all my comments are based on just reading the code.

Ah. My modification to rebase to the master for the time should
be culprit. Sorry for the dirty patch.

# I would recreate the patch if you complained before struggling
# with the thing..

The core of the modification is on closesel.c. I attached the
patched closesel.c.

> FWIW I've rebased my patch to the current master, it's available on
> github as usual:
>
> https://github.com/tvondra/postgres/commits/mvstats

Thanks.

> > The code to find mv-applicable clause is moved out of the main
> > flow of clauselist_selectivity. As I said in the previous mail,
> > the new function transformRestrictInfoForEstimate (too bad name
> > but just for PoC:) scans clauselist and generates
> > RestrictStatsData struct which drives mv-aware selectivity
> > calculation. This struct isolates MV and non-MV estimation.
> >
> > The struct RestrictStatData mainly consists of the following
> > three parts,
> >
> > - clause to be estimated by current logic (MV is not applicable)
> > - clause to be estimated by MV-staistics.
> > - list of child RestrictStatDatas, which are to be run
> > recursively.
> >
> > mvclause_selectivty() is the topmost function where mv stats
> > works. This structure effectively prevents main estimation flow
> > from being broken by modifying mvstats part. Although I haven't
> > measured but I'm positive the code is far reduced from yours.
> >
> > I attached two patches to this message. The first one is to
> > rebase v7 patch to current(maybe) master and the second applies
> > the refactoring.
> >
> > I'm a little anxious about performance but I think this makes the
> > process to apply mv-stats far clearer. Regtests for mvstats
> > succeeded asis except for fdep, which is not implememted in this
> > patch.
> >
> > What do you think about this?
>
> I'm not sure, at this point. I'm having a hard time understanding how
> exactly the code works - there are pretty much no comments explaining
> the implementation, so it takes time to understand the code. This is
> especially true about transformRestrictInfoForEstimate which is also
> quite long. I understand it's a PoC, but comments would really help.

The patch itself shold hardly readable because it's not from
master but from your last patch plus somthing.

My concern about the code at the time was following,

- You embedded the logic of multivariate estimation into
clauselist_selectivity. I think estimate using multivariate
statistics is quite different from the ordinary estimate based
on single column stats then they are logically separatable and
we should do so.

- You are taking top-down approach and it runs tree-walking to
check appliability of mv-stats for every stepping down in
clause tree. If the subtree found to be mv-applicable, split it
to two parts - mv-compatible and non-compatible. These steps
requires expression tree walking, which looks using too-much
CPU.

- You look to be considering the cases when users create many
multivariate statistics on attribute sets having
duplications. But it looks too-much for me. MV-stats are more
resource-eating so we can assume the minimum usage of that.

My suggestion in the patch is a bottom-up approach to find
mv-applicable portion(s) in the expression tree, which is the
basic way of planner overall. The approach requires no repetitive
run of tree walker, that is, pull_varnos. It could fail to find
the 'optimal' solution for complex situations but needs far less
calculation for almost the same return (I think..).

Even though it doesn't consider the functional dependency, the
reduce of the code shows the efficiency. It does not nothing
tricky.

> On a conceptual level, I think the idea to split the estimation into
> two phases - enrich the expression tree with nodes with details about
> stats etc, and then actually do the estimation in the second phase
> might be interesting. Not because it's somehow clearer, but because it
> gives us a chance to see the expression tree as a whole, with details
> about all the stats (with the current code we process/estimate the
> tree incrementally). But I don't really know how useful that would be.

It is difficult to say which approach is better sinch it is
affected by what we think important than other things. However I
concern about that your code substantially reconstructs the
expression (clause) tree midst of processing it. I believe it
should be a separate phase for simplicity. Of course additional
required resource is also should be considered but it is rather
reduced for this case.

> I don't think the proposed change makes the process somehow clearer. I
> know it's a PoC at this point, so I don't expect it to be perfect, but
> for me the original code is certainly clearer. Of course, I'm biased
> as I wrote the current code, and I (naturally) shaped it to match my
> ideas during the development process, and I'm much more familiar with
> it.

Mmm. we need someone else's opition:) What I think on this point
is described just above... OK, I try to describe this in other
words.

The embedded approach simply increases the state and code path
by, roughly, multiplication basis. The separate approcach adds
them in addition basis. I thinks this is the most siginificant
point of why I feel it 'clear'.

Of course, the acceptable complexity differs according to the
fundamental complexity, performance, required memory or someting
others but I feel it is too-much complexity for the objective.

> Omitting the support for functional dependencies is a bit unfortunate,
> I think. Is that merely to make the PoC simpler, or is there something
> that makes it impossible to support that kind of stats?

I don't think so. I ommited it simply because it would more time
to implement.

> Another thing that I noticed is that you completely removed the code
> that combined multiple stats (and selected the best combination of
> stats). In other words, you've reverted to the intermediate single
> statistics approach, including removing the improved handling of OR
> clauses and conditions.

Yeah, good catch :p I noticed that just after submitting the
patch that I retaion only one statistics at the second level from
the bottom but it is easily fixed by changing pruning timing. The
struct can hold multiple statistics anyway.

And I don't omit OR case. It is handled along with the AND
case. (in wrong way?)

> It's a bit difficult to judge the proposed
> approach not knowing how well it supports those (quite crucial)
> features. What if it can't support some them., or what if it makes the
> code much more complicated (thus defeating the goal of making it more
> clear)?

OR is supported, Fdep is maybe supportable, but all of them
occurs within the function with the entangled name
(transform..something). But I should put more consider on your
latest code before that.

> I share your concern about the performance impact - one thing is that
> this new code might be slower than the original one, but a more
> serious issue IMHO is that the performance impact will happen even for
> relations with no multivariate stats at all. The original patch was
> very careful about getting ~0% overhead in such cases,

I don't think so. find_stats runs pull_varnos and
transformRestric.. also uses pull_varnos to bail out at the top
level. They should have almost the same overhead for the case.

> and if the new
> code does not allow that, I don't see this approach as acceptable. We
> must not put additional overhead on people not using multivariate
> stats.
>
> But I think it's worth exploring this idea a bit more - can you rebase
> it to the current patch version (as on github) and adding the missing
> pieces (functional dependencies, multi-statistics estimation and
> passing conditions)?

With pleasure. Please wait for a while.

> One more thing - I noticed you extended the pg_operator catalog with a
> oprmvstat attribute, used to flag operators that are compatible with
> multivariate stats. I'm not happy with the current approach (using
> oprrest to do this decision), but I'm not really sure this is a good
> solution either. The culprit is that it only answers one of the two
> important questions - Is it compatible? How to perform the estimation?

Hostly saying, I also don't like this. But checking oprrest is
unpleasant much the same.

> So we'd have to rely on oprrest anyway, when actually performing the
> estimation of a clause with "compatible" operator. And we'd have to
> keep in sync two places (catalog and checks in file), and we'd have to
> update the catalog after improving the implementation (adding support
> for another operator).

Mmm. It depends on what the deveopers think about the definition
of oprrest. More practically, I'm worried whether it cannot be
other than eqsel for any equality operator. And the same for
comparison operators.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2015-07-27 07:15:12 Re: We need to support ForeignRecheck for late row locking, don't we?
Previous Message David Rowley 2015-07-27 06:59:42 Re: Sharing aggregate states between different aggregate functions