Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-30 13:47:59
Message-ID: 55BA2B0F.5060209@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

On 07/30/2015 01:26 PM, Kyotaro HORIGUCHI wrote:
> Hello, I certainly attached the file this time.
>
>
> At Mon, 27 Jul 2015 23:54:08 +0200, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <55B6A880(dot)3050801(at)2ndquadrant(dot)com>
>
>> The bottom-up would work too, probably - I mean, we could start from
>> leaves of the expression tree, and build the largest "subtree"
>> compatible with multivariate stats and then try to estimate it. I
>> don't see how we could pass conditions though, which works naturally
>> in the top-down approach.
>
> By the way, the 'condition' looks to mean what will be received
> by the parameter of clause(list)_selectivity with the same
> name. But it is always NIL. Looking at the comment for
> collect_mv_attnum, it is prepared for 'multitable statistics'. If
> so, I think it's better removed from the current patch, because
> it is useless now.

I don't think so. Conditions certainly are not meant for multitable
statistics only (I don't see any comment suggesting that at
collect_mv_attnums), but are actually used with the current code.

For example try this:

create table t (a int, b int, c int);
insert into t select i/100, i/100, i/100
from generate_series(1,100000) s(i);
alter table t add statistics (mcv) on (a,b);
analyze t;

select * from t where a<10 and b < 10 and (a < 50 or b < 50 or c < 50);

What will happen when estimating this query is this:

(1) clauselist_selectivity is called, and sees a list of three clauses:

(a<10)
(b<10)
(a<50 OR b<50 OR c<50)

But there's only a single statistics on columns [a,b] so at this
point we can process only the first two clauses. So we'll do that,
computing

P(a<10, b<10)

and we'll pass the OR-clause to the clause_selectivity() call, along
with the two already estimated clauses as conditions.

(b) clause_selectivity will receive (a<50 OR b<50 OR c<50) as a clause
to estimate, and the two clauses as conditions, computing

P(a<50 OR b<50 OR c<50 | a<10, b<10)

The current estimate for the OR-clause is off, but I believe that's a
bug in the current implementation of clauselist_selectivity_or(), and
we've already discussed that some time ago.

>
> The functional dependency code looks immature in both the
> detection phase and application phase in comparison to MCV and
> histogram. Addition to that, as the comment in dependencies.c
> says, fdep is not so significant (than MCV/HIST) because it is
> usually carefully avoided and should be noticed and considered in
> designing of application or the whole system.

The code is certainly imperfect and needs improvements, no doubt about
that. I have certainly spent much more time on MCV/histograms.

I'm not sure about stating that functional dependencies are less
significant than MCV/HIST (I don't see any such statement in
dependencies.c). I might have thought that initially, when I opted to
implement fdeps as the simplest possible type of statistics, but I think
it's quite practical, actually.

I however disagree about the last point - it's true that in many cases
the databases are carefully normalized, which mostly makes functional
dependencies irrelevant. But this is only true for OLTP systems, while
the primary target of the patch are DSS/DWH systems. And in those
systems denormalization is a very common practice.

So I don't think fdeps are completely irrelevant - it's quite useful in
some scenarios, actually. Similarly to the ndistinct coefficient stats
that you proposed, for example.

>
> Persisting to apply them all at once doesn't seem to be a good
> strategy to be adopted earlier.

Why?

>
> Or perhaps it might be better to register the dependency itself
> than registering incomplete information (only the set of colums
> involoved in the relationship) and try to detect the relationship
> from the given values. I suppose those who can register the
> columnset know the precise nature of the dependency in advance.

I don't see how that could be done? I mean, you only have the constants
supplied in the query - how could you verify the functional dependency
based on just those values (or even decide the direction)?

>>
>> What do you mean by "reconstruct the expression tree"? It's true I'm
>> walking the expression tree top-down, but how is that reconstructing?
>
> For example clauselist_mv_split does. It separates mvclauses from
> original clauselist and apply mv-stats at once and (parhaps) let
> the rest be processed in the 'normal' route. I called this as
> "reconstruct", which I tried to do explicity and separately.

Ah, I see. Thanks for the explanation. I wouldn't call this
"reconstruction" though - I merely need to track which clauses to
estimate using multivariate stats (and which need to be estimated using
the regular stats). That's pretty much what RestrictStatData does, no?

>>
>> I find your comments very valuable. I may not agree with some of
>> them, but I certainly appreciate your point of view. So thank you
>> very much for the time you spent reviewing this patch so far!
>
> Yeah, thank you for your patience and kindness.

Likewise. It's very frustrating trying to understand complex code
written by someone else, and I appreciate your effort.

>> Regarding the complexity - I am not too worried about spending
>> more CPU cycles on this, as long as it does not impact the case
>> where people have no multivariate statistics at all. That's because
>> I expect people to use this for large DSS/DWH data sets with lots
>> of dependencies in the (often denormalized) tables and complex
>> conditions - in those cases the planning difference is negligible,
>> especially if the improved estimates make the query run in seconds
>> instead of hours.
>
> I share the vision with you. If that is the case, the mv-stats
> route should not be intrude the existing non-mv-stats route. I
> feel you have too much intruded clauselist_selectivity all the
> more.
>
> If that is the case, my mv-distinct code has different objective
> from you. It aims to save the misestimation from multicolumn
> correlations more commonly occurs in OLTP usage.

OK. Let's see if we can make it work for both use cases.

>
>> This is why I was so careful to entirely skip the expensive
>> processing when where were no multivariate stats, and why I don't
>> like the fact that your approach makes this skip more difficult (or
>> maybe impossible, I'm not sure).
>
> My code totally skips if transformRestrictionForEstimate returns
> NULL and runs clauselist_selectivity as usual. I think almost the
> same as yours.

Ah, OK. Perhaps I missed that as I've had trouble applying the patch.

>
> However, if you think it I believe we should not only skipping
> calculation but also hiding the additional code blocks which is
> overwhelming the normal route. The one of major objectives of my
> approach is that point.

My main concern at this point was planning time, so skipping the
calculation should be enough I believe. Hiding the additional code
blocks is a matter of aesthetics, and we can address that by moving it
to a separate method or such.

>
> But sorry. I found that considering multiple stats at every level
> cannot be done without exhaustive searching of combinations among
> child clauses and needs additional data structure. It needs more
> thoughs.. As mentioned later, top-down might be suitable for
> this optimization.

Do you think a combined approach - first bottom-up preprocessing, then
top-down optimization (using the results of the first phase to speed
things up) - might work?

>> Understood. As I explained above, I'm not all that concerned about
>> the performance impact, as long as we make sure it only applies to
>> people using the multivariate stats.
>>
>> I also think a combined approach - first a bottom-up step
>> (identifying the largest compatible subtrees & caching the varnos),
>> then a top-down step (doing the same optimization as implemented
>> today) might minimize the performance impact.
>
> I almost reaching the same conclusion.

Ah, so the answer to my last question is "yes". Now we only need to
actually code it ;-)

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2015-07-30 13:52:18 Re: [COMMITTERS] pgsql: Row-Level Security Policies (RLS)
Previous Message Andrew Dunstan 2015-07-30 13:43:03 Re: The real reason why TAP testing isn't ready for prime time