Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-27 21:54:08
Message-ID: 55B6A880.3050801@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Horiguchi-san,

On 07/27/2015 09:04 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
> At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <55B3FB0B(dot)7000201(at)2ndquadrant(dot)com>
>> Hi,
>>
>> On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
>>> Hi, I'd like to show you the modified constitution of
>>> multivariate statistics application logic. Please find the
>>> attached. They apply on your v7 patch.
>>
>> Sadly I do have some trouble getting it to apply correctly :-(
>> So for now all my comments are based on just reading the code.
>
> Ah. My modification to rebase to the master for the time should
> be culprit. Sorry for the dirty patch.
>
> # I would recreate the patch if you complained before struggling
> # with the thing..
>
> The core of the modification is on closesel.c. I attached the
> patched closesel.c.

I don't see any attachment. Perhaps you forgot to actually attach it?

>
> My concern about the code at the time was following,
>
> - You embedded the logic of multivariate estimation into
> clauselist_selectivity. I think estimate using multivariate
> statistics is quite different from the ordinary estimate based
> on single column stats then they are logically separatable and
> we should do so.

I don't see them as very different, actually quite the opposite. The two
kinds of statistics are complementary and should naturally coexist.
Perhaps the current code is not perfect and a refactoring would make the
code more readable, but I don't think it's primary aim should be to
separate regular and multivariate stats.

>
> - You are taking top-down approach and it runs tree-walking to
> check appliability of mv-stats for every stepping down in
> clause tree. If the subtree found to be mv-applicable, split it
> to two parts - mv-compatible and non-compatible. These steps
> requires expression tree walking, which looks using too-much
> CPU.

I'm taking top-down approach because that's what the regular stats do,
and also because that's what allows implementing the features that I
think are interesting - ability to combine multiple stats in an
efficient way, pass conditions and such. I think those two features are
very useful and allow very interesting things.

The bottom-up would work too, probably - I mean, we could start from
leaves of the expression tree, and build the largest "subtree"
compatible with multivariate stats and then try to estimate it. I don't
see how we could pass conditions though, which works naturally in the
top-down approach.

Or maybe a combination of both - identify the "compatible" subtrees
first, then perform the top-down phase.

> - You look to be considering the cases when users create many
> multivariate statistics on attribute sets having
> duplications. But it looks too-much for me. MV-stats are more
> resource-eating so we can assume the minimum usage of that.

Not really. I don't expect huge numbers of multivariate stats to be
built on the tables.

But I think restricting the users to use a single multivariate
statistics per table would be a significant limitation. And once you
allow using multiple multivariate statistics for the set of clauses,
supporting over-lapping stats is not that difficult.

What it however makes possible is combining multiple "small" stats into
a larger one in a very efficient way - it assumes the overlap is
sufficient, of course. But if that's true you may build multiple small
(and very accurate) stats instead of one huge (or very inaccurate)
statistics.

This also makes it possible to handle complex combinations of clauses
that are compatible and incompatible with multivariate statistics, by
passing the conditions.

>
> My suggestion in the patch is a bottom-up approach to find
> mv-applicable portion(s) in the expression tree, which is the
> basic way of planner overall. The approach requires no repetitive
> run of tree walker, that is, pull_varnos. It could fail to find
> the 'optimal' solution for complex situations but needs far less
> calculation for almost the same return (I think..).
>
> Even though it doesn't consider the functional dependency, the
> reduce of the code shows the efficiency. It does not nothing
> tricky.

OK

>> On a conceptual level, I think the idea to split the estimation into
>> two phases - enrich the expression tree with nodes with details about
>> stats etc, and then actually do the estimation in the second phase
>> might be interesting. Not because it's somehow clearer, but because it
>> gives us a chance to see the expression tree as a whole, with details
>> about all the stats (with the current code we process/estimate the
>> tree incrementally). But I don't really know how useful that would be.
>
> It is difficult to say which approach is better sinch it is
> affected by what we think important than other things. However I
> concern about that your code substantially reconstructs the
> expression (clause) tree midst of processing it. I believe it
> should be a separate phase for simplicity. Of course additional
> required resource is also should be considered but it is rather
> reduced for this case.

What do you mean by "reconstruct the expression tree"? It's true I'm
walking the expression tree top-down, but how is that reconstructing?

>
>> I don't think the proposed change makes the process somehow clearer. I
>> know it's a PoC at this point, so I don't expect it to be perfect, but
>> for me the original code is certainly clearer. Of course, I'm biased
>> as I wrote the current code, and I (naturally) shaped it to match my
>> ideas during the development process, and I'm much more familiar with
>> it.
>
> Mmm. we need someone else's opition:) What I think on this point
> is described just above... OK, I try to describe this in other
> words.

I find your comments very valuable. I may not agree with some of them,
but I certainly appreciate your point of view. So thank you very much
for the time you spent reviewing this patch so far!

> The embedded approach simply increases the state and code path by,
> roughly, multiplication basis. The separate approcach adds them in
> addition basis. I thinks this is the most siginificant point of why I
> feel it 'clear'.
>
> Of course, the acceptable complexity differs according to the
> fundamental complexity, performance, required memory or someting
> others but I feel it is too-much complexity for the objective.

Yes, I think we might have slightly different objectives in mind.

Regarding the complexity - I am not too worried about spending more CPU
cycles on this, as long as it does not impact the case where people have
no multivariate statistics at all. That's because I expect people to use
this for large DSS/DWH data sets with lots of dependencies in the (often
denormalized) tables and complex conditions - in those cases the
planning difference is negligible, especially if the improved estimates
make the query run in seconds instead of hours.

This is why I was so careful to entirely skip the expensive processing
when where were no multivariate stats, and why I don't like the fact
that your approach makes this skip more difficult (or maybe impossible,
I'm not sure).

It's also true that most OLTP queries (especially the short ones, thus
most impacted by the increase of planning time) use rather short/simple
clause lists, so even the top-down approach should be very cheap.

>
>> Omitting the support for functional dependencies is a bit unfortunate,
>> I think. Is that merely to make the PoC simpler, or is there something
>> that makes it impossible to support that kind of stats?
>
> I don't think so. I ommited it simply because it would more time
> to implement.

OK, thanks for confirming this.

>
>> Another thing that I noticed is that you completely removed the code
>> that combined multiple stats (and selected the best combination of
>> stats). In other words, you've reverted to the intermediate single
>> statistics approach, including removing the improved handling of OR
>> clauses and conditions.
>
> Yeah, good catch :p I noticed that just after submitting the
> patch that I retaion only one statistics at the second level from
> the bottom but it is easily fixed by changing pruning timing. The
> struct can hold multiple statistics anyway.

Great!

>
> And I don't omit OR case. It is handled along with the AND
> case. (in wrong way?)

Oh, I see. I got a bit confused because you've removed the optimization
step (and conditions), and that needs to be handled a bit differently
for the OR clauses.

>
>> It's a bit difficult to judge the proposed
>> approach not knowing how well it supports those (quite crucial)
>> features. What if it can't support some them., or what if it makes the
>> code much more complicated (thus defeating the goal of making it more
>> clear)?
>
> OR is supported, Fdep is maybe supportable, but all of them
> occurs within the function with the entangled name
> (transform..something). But I should put more consider on your
> latest code before that.

Good. Likewise, I'd like to see more of your approach ;-)

>
>> I share your concern about the performance impact - one thing is that
>> this new code might be slower than the original one, but a more
>> serious issue IMHO is that the performance impact will happen even for
>> relations with no multivariate stats at all. The original patch was
>> very careful about getting ~0% overhead in such cases,
>
> I don't think so. find_stats runs pull_varnos and
> transformRestric.. also uses pull_varnos to bail out at the top
> level. They should have almost the same overhead for the case.

Understood. As I explained above, I'm not all that concerned about the
performance impact, as long as we make sure it only applies to people
using the multivariate stats.

I also think a combined approach - first a bottom-up step (identifying
the largest compatible subtrees & caching the varnos), then a top-down
step (doing the same optimization as implemented today) might minimize
the performance impact.

>
>> and if the new
>> code does not allow that, I don't see this approach as acceptable. We
>> must not put additional overhead on people not using multivariate
>> stats.
>>
>> But I think it's worth exploring this idea a bit more - can you rebase
>> it to the current patch version (as on github) and adding the missing
>> pieces (functional dependencies, multi-statistics estimation and
>> passing conditions)?
>
> With pleasure. Please wait for a while.

Sure. Take your time.

>
>> One more thing - I noticed you extended the pg_operator catalog with a
>> oprmvstat attribute, used to flag operators that are compatible with
>> multivariate stats. I'm not happy with the current approach (using
>> oprrest to do this decision), but I'm not really sure this is a good
>> solution either. The culprit is that it only answers one of the two
>> important questions - Is it compatible? How to perform the estimation?
>
> Hostly saying, I also don't like this. But checking oprrest is
> unpleasant much the same.

The patch is already quite massive, so let's use the same approach as
current stats, and leave this problem for another patch. If we come up
with a great idea, we can work on it, but I see this as a loosely
related annoyance rather than something this patch aims to address.

>> So we'd have to rely on oprrest anyway, when actually performing the
>> estimation of a clause with "compatible" operator. And we'd have to
>> keep in sync two places (catalog and checks in file), and we'd have to
>> update the catalog after improving the implementation (adding support
>> for another operator).
>
> Mmm. It depends on what the deveopers think about the definition
> of oprrest. More practically, I'm worried whether it cannot be
> other than eqsel for any equality operator. And the same for
> comparison operators.

OTOH if you define a new operator with oprrest=F_EQSEL, you're
effectively saying "It's OK to estimate this using regular eq/lt/gt
operators". If your operator is somehow incompatible with that, you
should not set oprrest=F_EQSEL.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2015-07-27 21:59:44 Re: Autonomous Transaction is back
Previous Message Josh Berkus 2015-07-27 21:50:57 Re: Autonomous Transaction is back