Re: Shouldn't we have a way to avoid "risky" plans?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Віталій Тимчишин <tivv00(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Shouldn't we have a way to avoid "risky" plans?
Date: 2011-04-19 14:22:55
Message-ID: BANLkTinHaZygtwy9TfqjEmoRSHqQTxsfbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, Mar 24, 2011 at 4:30 PM, Nathan Boley <npboley(at)gmail(dot)com> wrote:
> Another approach, that hasn't been suggested yet, is some Bayesian
> update method. There, rather than calculating a specific parameter
> value ( like ndistinct ), you try to store the entire distribution and
> choose the plan that minimizes cost averaged over all of the possible
> parameter values.
>
> Example: ( please excuse the unrealistic numbers )
>
> For instance, rather than estimate the selectivity of the join (
> relation1.col1 = relation2.col1 ) to be 0.01, we would say it is 0.1
> w/ probability 0.2 and 0.001 with probability 0.8. So, here is how we
> would choose the plan now:
>
> cost( nestloop | selectivity = 0.01 ) = 1
> cost( hashjoin | selectivity = 0.01 ) = 2
> cost( mergejoin | selectivity = 0.01 ) = 50
>
> Here would be the bayesian approach:
>
> cost( nestloop | selectivity = 0.001 ) = 0.1
> cost( hashjoin | selectivity = 0.001 ) = 1
> cost( mergejoin | selectivity = 0.001 ) = 50
>
> cost( nestloop | selectivity = 0.1 ) = 10
> cost( hashjoin | selectivity = 0.1 ) = 3
> cost( mergejoin | selectivity = 0.1 ) = 50
>
> So, the bayesian costs are:
>
> nestloop: 0.1*0.8 + 10*0.2 = 2.08
> hashjoin: 1.0*0.8 + 3*0.2 = 1.4
> nestloop: 50*0.8 + 50*0.2 = 50
>
> so the hashjoin would be chosen.
>
> For completeness, the minimax costs would be:
>
> nestloop: max( 0.1, 10 )
> hashjoin: max( 1, 3   )
> nestloop: max( 50, 50 )
>
> So, again, the hashjoin is chosen.
>
> I obviously have a bias towards the Bayesian approach, but it's not
> because I expect it to necessarily perform a whole lot better but,
> rather, it reduces to the other two approaches. If we want the current
> behavior, then simply store the MLE selectivity with probability 1. If
> we want the minimax estimate, choose the worst possible value. Or
> anything in between.

This is a very elegant suggestion to this problem, and I really like
it. It elegantly models the concept we're trying to capture here,
which is that we're sometimes just guessing how things really are, and
if it turns out that we're way off, we may be stuck in a
pathologically bad plan.

One limitation of this method is that it is difficult to apply more
than locally. Consider:

SELECT * FROM foo, bar WHERE foo.id = bar.id AND some_crazy_function(foo.id)

The best method of joining foo and bar is likely going to depend on
the percentage of rows in foo for which some_crazy_function(foo.id)
returns true, and we have no idea what that is. We could represent
that by kicking out a range of probabilistic *cost* estimates for each
path over foo, but that adds a lot of code complexity. And compute
time - because (I think) now we've made it so that more paths have to
percolate all the way up through the join tree.

It's perhaps also worth looking at our old nemesis:

SELECT * FROM foo WHERE a = 1 ORDER BY b LIMIT 1

What I really want to do here is have the planner be able to reflect
the fact that an index scan over b may be very expensive or very cheap
depending on how lucky we get applying the a = 1 predicate, but I'm
not quite sure how to make that work.

It seems like the time when this would help the most without costing
too much or requiring excessively invasive surgery is the case where
the join selectivity itself is uncertain. We can estimate that fairly
accurately as far as MCVs go, but after that it does get very murky.
Still, my gut feeling is that many (not all) of the worst problems
actually bubble up from under the join, rather than happening at that
level.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2011-04-19 14:29:13 Re: Shouldn't we have a way to avoid "risky" plans?
Previous Message Laurent Laborde 2011-04-19 13:44:09 Re: postgresql random io test with 2 SSD Kingston V+100 500GB in (software) Raid1