On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > Greg, Tom,
> > > a) We already use block based sampling to reduce overhead. If
> > > you're talking about using the entire block and not just
> > > randomly sampled tuples from within those blocks then your
> > > sample will be biased.
> > There are actually some really good equations to work with this,
> > estimating both row correlation and n-distinct based on sampling
> > complete blocks. See prior discussions on this list on
> > N-distinct.
> In the prior discussions someone posted the paper with the algorithm
> I mentioned. That paper mentions that previous work showed poor
> results at estimating n_distinct even with sample sizes as large as
> 50% or more.
Which paper? People have referenced several different ones.
> > > 1) You have n^2 possible two-column combinations. That's a lot
> > > of processing and storage.
> > Yes, that's the hard problem to solve. Actually, btw, it's n!,
> > not n^2.
> You have O(n^2) possible *two-column* combinations and O(n!)
> n-column combinations. I was hoping two-column combinations would
> be enough information to extrapolate from for larger combinations.
The math nerd in me says that this is bad math, but it might work
"well enough" by ass-u-ming a lack of higher-order correllations.
> > > 2) It isn't even clear what data you're exactly looking for.
> > > Certainly "correlation" is just shorthand here and isn't what
> > > you're actually looking for.
> > Actually, I'd think that a correlation number estimate (0 =
> > complete uncorrelated, 1 = completely correlated) would be
> > sufficient to improve row count estimation significantly, without
> > incurring the vast overhead of histogramXhistorgram manipulation.
> No, correlation is actually quite uninteresting here. Consider for
> example two columns "state" and "area code". The "area code" is
> pretty much 100% dependent on state, but will show very little
There are quantitative tests of independence available. I'm not sure
whether any of these have been applied even theoretically in
> Hopefully you're right that you don't need complete histograms.
> Perhaps there's some statistics concept they don't teach in stats
> 101 that would cover this need. What we're looking for is a
> function f(a,b,x) that gives the net selectivity given a and b, the
> selectivity of two clauses based on two columns, and x some simple
> value that can be easily calculated by looking at the data in
That would be neat :)
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Remember to vote!
In response to
pgsql-hackers by date
|Next:||From: Christopher Kings-Lynne||Date: 2006-06-02 01:38:54|
|Subject: Re: [PATCH] Magic block for modules|
|Previous:||From: Greg Stark||Date: 2006-06-02 00:36:16|
|Subject: Re: More thoughts about planner's cost estimates|