On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > 1. n-distinct estimation is bad, as previously discussed;
> > 2. our current heuristics sampling methods prevent us from sampling more than
> > 0.5% of any reasonably large table, causing all statistics on those tables to
> > be off for any table with irregular distribution of values;
> I'm convinced these two are more connected than you believe. For distributions
> of values in a range using small samples is on solid statistical basis. It's
> the same reason poll results based on only a few hundred people can accurately
> predict elections in a country with hundreds of millions of voters.
> However for things like estimating discrete values there's absolutely no solid
> foundation for these small samples. Your accuracy is going to be simply
> proportional to the sample size, meaning you can't get anything trustworthy
> without sampling large enough portions of the table that a full sequential
> scan would be just as fast.
> I might be interested in implementing that algorithm that was posted a while
> back that involved generating good unbiased samples of discrete values. The
> algorithm was quite clever and well described and paper claimed impressively
> good results.
> However it will only make sense if people are willing to accept that analyze
> will need a full table scan -- at least for tables where the DBA knows that
> good n_distinct estimates are necessary.
There *might* be an alternative....
Suppose that the executor could keep track of what values it's seeing
from a table as it's actually running a query. These could be used to
build statistical information without actually running analyze.
Of course, the problem is that you could end up with some seriously
skewed statistics, depending on what your query patterns are. But in
some ways that's not bad; you're optimizing for the queries you most
One possible way to handle this would be to keep track of how many times
each value has been seen, as well as when it was last seen, so you have
some idea of the quality of the data. Another idea is to somehow blend
these stats with the traditional method.
> > 3. We don't have any method to analyze inter-column correlation within a table;
> > 4. We don't keep statistics on foriegn key correlation;
> Gosh these would be nice but they sound like hard problems. Has anybody even
> made any headway in brainstorming how to tackle them?
A *huge* improvement would be gathering statistics on all multi-column
indexes. Some of the stats might not make sense in this context, but
others (such as correlation) would.
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
In response to
pgsql-hackers by date
|Next:||From: Jim C. Nasby||Date: 2006-06-01 23:22:12|
|Subject: Re: More thoughts about planner's cost estimates|
|Previous:||From: Joshua D. Drake||Date: 2006-06-01 23:04:46|
|Subject: Re: "CVS-Unknown" buildfarm failures?|