Re: multivariate statistics / patch v7

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-30 11:26:35
Message-ID: 20150730.202635.171579201.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, I certainly attached the file this time.

At Mon, 27 Jul 2015 23:54:08 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <55B6A880(dot)3050801(at)2ndquadrant(dot)com>
> > The core of the modification is on closesel.c. I attached the
> > patched closesel.c.
>
> I don't see any attachment. Perhaps you forgot to actually attach it?

Very sorry to have forgotten to attach it. I attached the new
patch applicable on the head of mvstats branch of your
repository.

> > My concern about the code at the time was following,
> >
> > - You embedded the logic of multivariate estimation into
> > clauselist_selectivity. I think estimate using multivariate
> > statistics is quite different from the ordinary estimate based
> > on single column stats then they are logically separatable and
> > we should do so.
>
> I don't see them as very different, actually quite the opposite. The
> two kinds of statistics are complementary and should naturally
> coexist. Perhaps the current code is not perfect and a refactoring
> would make the code more readable, but I don't think it's primary aim
> should be to separate regular and multivariate stats.
>
> > - You are taking top-down approach and it runs tree-walking to
> > check appliability of mv-stats for every stepping down in
> > clause tree. If the subtree found to be mv-applicable, split it
> > to two parts - mv-compatible and non-compatible. These steps
> > requires expression tree walking, which looks using too-much
> > CPU.
>
> I'm taking top-down approach because that's what the regular stats do,
> and also because that's what allows implementing the features that I
> think are interesting - ability to combine multiple stats in an
> efficient way, pass conditions and such. I think those two features
> are very useful and allow very interesting things.
>
> The bottom-up would work too, probably - I mean, we could start from
> leaves of the expression tree, and build the largest "subtree"
> compatible with multivariate stats and then try to estimate it. I
> don't see how we could pass conditions though, which works naturally
> in the top-down approach.

By the way, the 'condition' looks to mean what will be received
by the parameter of clause(list)_selectivity with the same
name. But it is always NIL. Looking at the comment for
collect_mv_attnum, it is prepared for 'multitable statistics'. If
so, I think it's better removed from the current patch, because
it is useless now.

> Or maybe a combination of both - identify the "compatible" subtrees
> first, then perform the top-down phase.
>
> > - You look to be considering the cases when users create many
> > multivariate statistics on attribute sets having
> > duplications. But it looks too-much for me. MV-stats are more
> > resource-eating so we can assume the minimum usage of that.
>
> Not really. I don't expect huge numbers of multivariate stats to be
> built on the tables.
>
> But I think restricting the users to use a single multivariate
> statistics per table would be a significant limitation. And once you
> allow using multiple multivariate statistics for the set of clauses,
> supporting over-lapping stats is not that difficult.
>
> What it however makes possible is combining multiple "small" stats
> into a larger one in a very efficient way - it assumes the overlap is
> sufficient, of course. But if that's true you may build multiple small
> (and very accurate) stats instead of one huge (or very inaccurate)
> statistics.
>
> This also makes it possible to handle complex combinations of clauses
> that are compatible and incompatible with multivariate statistics, by
> passing the conditions.
>
> >
> > My suggestion in the patch is a bottom-up approach to find
> > mv-applicable portion(s) in the expression tree, which is the
> > basic way of planner overall. The approach requires no repetitive
> > run of tree walker, that is, pull_varnos. It could fail to find
> > the 'optimal' solution for complex situations but needs far less
> > calculation for almost the same return (I think..).
> >
> > Even though it doesn't consider the functional dependency, the
> > reduce of the code shows the efficiency. It does not nothing
> > tricky.
>
> OK

The functional dependency code looks immature in both the
detection phase and application phase in comparison to MCV and
histogram. Addition to that, as the comment in dependencies.c
says, fdep is not so significant (than MCV/HIST) because it is
usually carefully avoided and should be noticed and considered in
designing of application or the whole system.

Persisting to apply them all at once doesn't seem to be a good
strategy to be adopted earlier.

Or perhaps it might be better to register the dependency itself
than registering incomplete information (only the set of colums
involoved in the relationship) and try to detect the relationship
from the given values. I suppose those who can register the
columnset know the precise nature of the dependency in advance.

> >> On a conceptual level, I think the idea to split the estimation into
> >> two phases - enrich the expression tree with nodes with details about
> >> stats etc, and then actually do the estimation in the second phase
> >> might be interesting. Not because it's somehow clearer, but because it
> >> gives us a chance to see the expression tree as a whole, with details
> >> about all the stats (with the current code we process/estimate the
> >> tree incrementally). But I don't really know how useful that would be.
> >
> > It is difficult to say which approach is better sinch it is
> > affected by what we think important than other things. However I
> > concern about that your code substantially reconstructs the
> > expression (clause) tree midst of processing it. I believe it
> > should be a separate phase for simplicity. Of course additional
> > required resource is also should be considered but it is rather
> > reduced for this case.
>
> What do you mean by "reconstruct the expression tree"? It's true I'm
> walking the expression tree top-down, but how is that reconstructing?

For example clauselist_mv_split does. It separates mvclauses from
original clauselist and apply mv-stats at once and (parhaps) let
the rest be processed in the 'normal' route. I called this as
"reconstruct", which I tried to do explicity and separately.

> >> I don't think the proposed change makes the process somehow clearer. I
> >> know it's a PoC at this point, so I don't expect it to be perfect, but
> >> for me the original code is certainly clearer. Of course, I'm biased
> >> as I wrote the current code, and I (naturally) shaped it to match my
> >> ideas during the development process, and I'm much more familiar with
> >> it.
> >
> > Mmm. we need someone else's opition:) What I think on this point
> > is described just above... OK, I try to describe this in other
> > words.
>
> I find your comments very valuable. I may not agree with some of them,
> but I certainly appreciate your point of view. So thank you very much
> for the time you spent reviewing this patch so far!

Yeah, thank you for your patience and kindness.

> > The embedded approach simply increases the state and code path by,
> > roughly, multiplication basis. The separate approcach adds them in
> > addition basis. I thinks this is the most siginificant point of why I
> > feel it 'clear'.
> >
> > Of course, the acceptable complexity differs according to the
> > fundamental complexity, performance, required memory or someting
> > others but I feel it is too-much complexity for the objective.
>
> Yes, I think we might have slightly different objectives in mind.

Sure! Now I'm understand what is the point.

> Regarding the complexity - I am not too worried about spending more
> CPU cycles on this, as long as it does not impact the case where
> people have no multivariate statistics at all. That's because I expect
> people to use this for large DSS/DWH data sets with lots of
> dependencies in the (often denormalized) tables and complex conditions
> - in those cases the planning difference is negligible, especially if
> the improved estimates make the query run in seconds instead of hours.

I share the vision with you. If that is the case, the mv-stats
route should not be intrude the existing non-mv-stats route. I
feel you have too much intruded clauselist_selectivity all the
more.

If that is the case, my mv-distinct code has different objective
from you. It aims to save the misestimation from multicolumn
correlations more commonly occurs in OLTP usage.

> This is why I was so careful to entirely skip the expensive processing
> when where were no multivariate stats, and why I don't like the fact
> that your approach makes this skip more difficult (or maybe
> impossible, I'm not sure).

My code totally skips if transformRestrictionForEstimate returns
NULL and runs clauselist_selectivity as usual. I think almost the
same as yours.

However, if you think it I believe we should not only skipping
calculation but also hiding the additional code blocks which is
overwhelming the normal route. The one of major objectives of my
approach is that point.

> It's also true that most OLTP queries (especially the short ones, thus
> most impacted by the increase of planning time) use rather
> short/simple clause lists, so even the top-down approach should be
> very cheap.
>
> >> Omitting the support for functional dependencies is a bit unfortunate,
> >> I think. Is that merely to make the PoC simpler, or is there something
> >> that makes it impossible to support that kind of stats?
> >
> > I don't think so. I ommited it simply because it would more time
> > to implement.
>
> OK, thanks for confirming this.
>
> >
> >> Another thing that I noticed is that you completely removed the code
> >> that combined multiple stats (and selected the best combination of
> >> stats). In other words, you've reverted to the intermediate single
> >> statistics approach, including removing the improved handling of OR
> >> clauses and conditions.
> >
> > Yeah, good catch :p I noticed that just after submitting the
> > patch that I retaion only one statistics at the second level from
> > the bottom but it is easily fixed by changing pruning timing. The
> > struct can hold multiple statistics anyway.
>
> Great!

But sorry. I found that considering multiple stats at every level
cannot be done without exhaustive searching of combinations among
child clauses and needs additional data structure. It needs more
thoughs.. As mentioned later, top-down might be suitable for
this optimization.

> > And I don't omit OR case. It is handled along with the AND
> > case. (in wrong way?)
>
> Oh, I see. I got a bit confused because you've removed the
> optimization step (and conditions), and that needs to be handled a bit
> differently for the OR clauses.

Sorry to have forced you reading unapplicable patch:p

> >> It's a bit difficult to judge the proposed
> >> approach not knowing how well it supports those (quite crucial)
> >> features. What if it can't support some them., or what if it makes the
> >> code much more complicated (thus defeating the goal of making it more
> >> clear)?
> >
> > OR is supported, Fdep is maybe supportable, but all of them
> > occurs within the function with the entangled name
> > (transform..something). But I should put more consider on your
> > latest code before that.
>
> Good. Likewise, I'd like to see more of your approach ;-)
>
> >
> >> I share your concern about the performance impact - one thing is that
> >> this new code might be slower than the original one, but a more
> >> serious issue IMHO is that the performance impact will happen even for
> >> relations with no multivariate stats at all. The original patch was
> >> very careful about getting ~0% overhead in such cases,
> >
> > I don't think so. find_stats runs pull_varnos and
> > transformRestric.. also uses pull_varnos to bail out at the top
> > level. They should have almost the same overhead for the case.
>
> Understood. As I explained above, I'm not all that concerned about the
> performance impact, as long as we make sure it only applies to people
> using the multivariate stats.
>
> I also think a combined approach - first a bottom-up step (identifying
> the largest compatible subtrees & caching the varnos), then a top-down
> step (doing the same optimization as implemented today) might minimize
> the performance impact.

I almost reaching the same conclusion.

> >> and if the new
> >> code does not allow that, I don't see this approach as acceptable. We
> >> must not put additional overhead on people not using multivariate
> >> stats.
> >>
> >> But I think it's worth exploring this idea a bit more - can you rebase
> >> it to the current patch version (as on github) and adding the missing
> >> pieces (functional dependencies, multi-statistics estimation and
> >> passing conditions)?
> >
> > With pleasure. Please wait for a while.
>
> Sure. Take your time.
>
> >
> >> One more thing - I noticed you extended the pg_operator catalog with a
> >> oprmvstat attribute, used to flag operators that are compatible with
> >> multivariate stats. I'm not happy with the current approach (using
> >> oprrest to do this decision), but I'm not really sure this is a good
> >> solution either. The culprit is that it only answers one of the two
> >> important questions - Is it compatible? How to perform the estimation?
> >
> > Hostly saying, I also don't like this. But checking oprrest is
> > unpleasant much the same.
>
> The patch is already quite massive, so let's use the same approach as
> current stats, and leave this problem for another patch. If we come up
> with a great idea, we can work on it, but I see this as a loosely
> related annoyance rather than something this patch aims to address.

Agreed.

> >> So we'd have to rely on oprrest anyway, when actually performing the
> >> estimation of a clause with "compatible" operator. And we'd have to
> >> keep in sync two places (catalog and checks in file), and we'd have to
> >> update the catalog after improving the implementation (adding support
> >> for another operator).
> >
> > Mmm. It depends on what the deveopers think about the definition
> > of oprrest. More practically, I'm worried whether it cannot be
> > other than eqsel for any equality operator. And the same for
> > comparison operators.
>
> OTOH if you define a new operator with oprrest=F_EQSEL, you're
> effectively saying "It's OK to estimate this using regular eq/lt/gt
> operators". If your operator is somehow incompatible with that, you
> should not set oprrest=F_EQSEL.

In contrast, some function other than F_EQSEL might be compatible
with mv-statistics.

For all that, it's not my concern. Although I think they really
are effectively the same, I'm uneasy to use the field apparently
not intended (or suitable) to distinguish such kind of property
of operator.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Modify-the-estimate-path-to-be-bottom-up-processing.patch text/x-patch 332.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2015-07-30 11:26:38 Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Previous Message Heikki Linnakangas 2015-07-30 11:09:54 Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"