Re: Shouldn't we have a way to avoid "risky" plans?

From: Nathan Boley <npboley(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Віталій Тимчишин <tivv00(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Shouldn't we have a way to avoid "risky" plans?
Date: 2011-03-24 20:30:42
Message-ID: AANLkTikqEHNGR6zwX8JJmrq4PEHi7HCQU-k_Vh0KKKSH@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>> This can se GUC-controllable. Like plan_safety=0..1 with low default value.
>> This can influence costs of plans where cost changes dramatically with small
>> table changes and/or statistics is uncertain. Also this can be used as
>> direct "hint" for such dangerous queries by changing GUC for session/single
>> query.
>
> ISTM if you add statistics miss and 'risk margin' to the things the
> planner would have to consider while generating a plan, you are
> greatly increasing the number of plan paths that would have to be
> considered for any non trivial query.

FWIW, these ideas are well established in the statistical community.

Currently, we are essentially using "maximum likelihood estimators".
We estimate a bunch of selectivities by choosing what is most likely,
plug them in to our objective function, and then minimize based upon
the plugged in values. In a wide variety of cases, MLE's can be shown
to be "asymptotically" optimal. That is, as our sample distribution
approaches the true distribution, the best we can possibly do is to
use the MLE. This is pretty sensible - if we actually knew all of the
selectivities then the results aren't really random anymore. However,
they often perform very poorly with small sample sizes - particularly
if the loss function is very sensitive to relatively small
fluctuations in the parameter estimates ( which postgres certainly is
- switching from a hash join to a nest-loop can be disastrous ).

Using the estimate that minimizes the "worst-case" performance is
precisely a minimax estimator. There, the goal is to minimize the risk
function ( iow, plan cost ) under the worst possible conditions.
Wikipedia has a pretty good treatment - just think "plan cost"
whenever you see "risk".

Another approach, that hasn't been suggested yet, is some Bayesian
update method. There, rather than calculating a specific parameter
value ( like ndistinct ), you try to store the entire distribution and
choose the plan that minimizes cost averaged over all of the possible
parameter values.

Example: ( please excuse the unrealistic numbers )

For instance, rather than estimate the selectivity of the join (
relation1.col1 = relation2.col1 ) to be 0.01, we would say it is 0.1
w/ probability 0.2 and 0.001 with probability 0.8. So, here is how we
would choose the plan now:

cost( nestloop | selectivity = 0.01 ) = 1
cost( hashjoin | selectivity = 0.01 ) = 2
cost( mergejoin | selectivity = 0.01 ) = 50

Here would be the bayesian approach:

cost( nestloop | selectivity = 0.001 ) = 0.1
cost( hashjoin | selectivity = 0.001 ) = 1
cost( mergejoin | selectivity = 0.001 ) = 50

cost( nestloop | selectivity = 0.1 ) = 10
cost( hashjoin | selectivity = 0.1 ) = 3
cost( mergejoin | selectivity = 0.1 ) = 50

So, the bayesian costs are:

nestloop: 0.1*0.8 + 10*0.2 = 2.08
hashjoin: 1.0*0.8 + 3*0.2 = 1.4
nestloop: 50*0.8 + 50*0.2 = 50

so the hashjoin would be chosen.

For completeness, the minimax costs would be:

nestloop: max( 0.1, 10 )
hashjoin: max( 1, 3 )
nestloop: max( 50, 50 )

So, again, the hashjoin is chosen.

I obviously have a bias towards the Bayesian approach, but it's not
because I expect it to necessarily perform a whole lot better but,
rather, it reduces to the other two approaches. If we want the current
behavior, then simply store the MLE selectivity with probability 1. If
we want the minimax estimate, choose the worst possible value. Or
anything in between.

Also, ( not that I have even close to the experience / expertise to
make this claim - so take this with a grain of salt ) it seems that
the code changes would be substantial but pretty straightforward and
easy to localize. Rather than passing a selectivity, pass a pair of
arrays with selectivities and probabilities. Initially, we could keep
the current estimates ( every passed array would be of length one )
and then make changes as problems appear ( like Josh's )

I hope my little estimation procedure tutorial has been a little
helpful, please feel free to contact me off list if you have
questions/want references.

Best,
Nathan Boley

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2011-03-24 22:23:15 Re: Shouldn't we have a way to avoid "risky" plans?
Previous Message Merlin Moncure 2011-03-24 18:41:45 Re: Shouldn't we have a way to avoid "risky" plans?