Re: More thoughts about planner's cost estimates

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-01 18:25:56
Message-ID: 8764jkhgwb.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Josh Berkus <josh(at)agliodbs(dot)com> writes:

> 1. n-distinct estimation is bad, as previously discussed;
>
> 2. our current heuristics sampling methods prevent us from sampling more than
> 0.5% of any reasonably large table, causing all statistics on those tables to
> be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

> 3. We don't have any method to analyze inter-column correlation within a table;
>
> 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to exist
> as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Treat 2006-06-01 18:29:47 stable snapshot looks outdated
Previous Message Tzahi Fadida 2006-06-01 17:39:24 Re: CTID issues and a soc student in need of help