Skip site navigation (1) Skip section navigation (2)

Re: More thoughts about planner's cost estimates

From: Greg Stark <gsstark(at)mit(dot)edu>
To: josh(at)agliodbs(dot)com
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-01 19:14:17
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
Josh Berkus <josh(at)agliodbs(dot)com> writes:

> > However it will only make sense if people are willing to accept that
> > analyze will need a full table scan -- at least for tables where the DBA
> > knows that good n_distinct estimates are necessary.
> What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
> 1 in 20 rows, should require siginificantly less scanning, and yet give us 
> a large enough sample for reasonable accuracy.

a) We already use block based sampling to reduce overhead. If you're talking
   about using the entire block and not just randomly sampled tuples from
   within those blocks then your sample will be biased.

b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates.
   Your probability of being accurate for discrete processes like n_distinct
   is going to be more or less proportional to your sample. So sampling 5% of
   the table gives you only a 5% of the chance of an accurate prediction. More
   or less.

> Actually, these  both seem like fairly straightforward problems 
> storage-wise.  The issue is deriving the statistics, for tables with many 
> columns or FKs.  

Well there are problems on many levels:

1) You have n^2 possible two-column combinations. That's a lot of processing
   and storage.

2) It isn't even clear what data you're exactly looking for. Certainly
   "correlation" is just shorthand here and isn't what you're actually looking
   for. You need something like a complete histogram for the dependent column
   for every possible value of the forcing column. That would be an enormous
   amount of storage.

> OK ... but still, it should become a "little knob" rather than the "big 
> knob" it is currently.

Sure, the more accurately you model the cache the less need there would be to
adjust random_page_cost. I'm not sure how accurate Postgres can really get
unless it switches philosophies towards using O_DIRECT and doing it's own
caching and scheduling. But it could certainly be much better than today.
Tom's comment makes it look like there's hope for pretty accurate cache
modelling for nested loop plans.


In response to


pgsql-hackers by date

Next:From: Tom LaneDate: 2006-06-01 19:15:09
Subject: Re: More thoughts about planner's cost estimates
Previous:From: Josh BerkusDate: 2006-06-01 18:32:21
Subject: Re: More thoughts about planner's cost estimates

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group