Re: Tuning planner cost estimates

From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, testperf-general(at)pgfoundry(dot)org
Subject: Re: Tuning planner cost estimates
Date: 2005-05-20 22:40:55
Message-ID: 20050520224055.GM44623@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, May 20, 2005 at 04:47:38PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
> > On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote:
> >> can test our formula for accuracy and precision. However, such a formula
> >> *does* need to take into account concurrent activity, updates, etc ... that
> >> is, it needs to approximately estimate the relative cost on a live database,
> >> not a test one.
>
> > Well, that raises an interesting issue, because AFAIK none of the cost
> > estimate functions currently do that.
>
> I'm unconvinced that it'd be a good idea, either. People already
> complain that the planner's choices change when they ANALYZE; if the
> current load factor or something like that were to be taken into account
> then you'd *really* have a problem with irreproducible behavior.
>
> It might make sense to have something a bit more static, perhaps a GUC
> variable that says "plan on the assumption that there's X amount of
> concurrent activity". I'm not sure what scale to measure X on, nor
> exactly how this would factor into the estimates anyway --- but at least
> this approach would maintain reproducibility of behavior.

Or allowing the load of the machine to affect query plans dynamically is
something that could be disabled by default, so presumably if you turn
it on it means you know what you're doing.

Of course this is all academic until we have a means to actually measure
how much system load affects the different things we estimate cost for,
and I don't see that happening until we have a system for measuring how
changing different input variables affects costs.

> > Another issue is: what state should the buffers/disk cache be in?
>
> The current cost models are all based on the assumption that every query
> starts from ground zero: nothing in cache. Which is pretty bogus in
> most real-world scenarios. We need to think about ways to tune that
> assumption, too. Maybe this is actually the same discussion, because
> certainly one of the main impacts of a concurrent environment is on what
> you can expect to find in cache.

Well, load doesn't directly effect cache efficiency; it's really a
question of the ratios of how often different things in the database are
accessed. If you wanted to get a crude idea of how likely pages from
some relation are to be in cache, you could take periodic snapshots of
io stats and see what percentage of the IO done in a given time period
was on the relation you're interested in as compared to the rest of the
database. But I think this is probably still a 2nd order effect.

In terms of a testing system, here's what I'm thinking of. For each cost
estimate, there will be a number of input variables we want to vary, and
then check to see how changes in them effect run time. Using index scan
as a simple example, 1st order variables will likely be index and table
size (especially in relation to cache size), and correlation. So, we
need some kind of a test harness that can vary these variables
(prefferably one at a time), and run a test case after each change. It
would then need to store the timing info, possibly along with other
information (such as explain output). If I'm the one to write this it'll
end up in perl, since that's the only language I know that would be able
to accomplish this. DBT seems to be a reasonable test database to do
this testing with, especially since it would provide a ready means to
provide external load.

Does this sound like a reasonable approach? Also, how important do
people think it is to use explain analyze output instead of just doing
SELECT count(*) FROM (query you actually want to test)? (The select
count(*) wrapper is just a means to throw away the results since we
don't really want to worry about data transfer times, etc). The testing
I've done (http://stats.distributed.net/~decibel/base.log) shows explain
analyze to be almost 5x slower than select count(*), so it seems a big
gain if we can avoid that.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jim C. Nasby 2005-05-20 22:49:17 Re: Tuning planner cost estimates
Previous Message Josh Berkus 2005-05-20 22:23:16 Re: Tuning planner cost estimates