Skip site navigation (1) Skip section navigation (2)

Re: More thoughts about planner's cost estimates

From: David Fetter <david(at)fetter(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: josh(at)agliodbs(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-02 01:04:12
Message-ID: 20060602010412.GI10906@fetter.org (view raw or flat)
Thread:
Lists: pgsql-hackers
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
> 
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> 
> > Greg, Tom,
> > 
> > > a) We already use block based sampling to reduce overhead.  If
> > > you're talking about using the entire block and not just
> > > randomly sampled tuples from within those blocks then your
> > > sample will be biased.
> > 
> > There are actually some really good equations to work with this,
> > estimating both row correlation and n-distinct based on sampling
> > complete blocks.  See prior discussions on this list on
> > N-distinct.
> 
> In the prior discussions someone posted the paper with the algorithm
> I mentioned.  That paper mentions that previous work showed poor
> results at estimating n_distinct even with sample sizes as large as
> 50% or more.

Which paper?  People have referenced several different ones.

> > > 1) You have n^2 possible two-column combinations.  That's a lot
> > > of processing and storage.
> > 
> > Yes, that's the hard problem to solve.  Actually, btw, it's n!,
> > not n^2.
> 
> You have O(n^2) possible *two-column* combinations and O(n!)
> n-column combinations.  I was hoping two-column combinations would
> be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
"well enough" by ass-u-ming a lack of higher-order correllations.

> > > 2) It isn't even clear what data you're exactly looking for.
> > > Certainly "correlation" is just shorthand here and isn't what
> > > you're actually looking for. 
> > 
> > Actually, I'd think that a correlation number estimate (0 =
> > complete uncorrelated, 1 = completely correlated) would be
> > sufficient to improve row count estimation significantly, without
> > incurring the vast overhead of histogramXhistorgram manipulation.
> 
> No, correlation is actually quite uninteresting here.  Consider for
> example two columns "state" and "area code".  The "area code" is
> pretty much 100% dependent on state, but will show very little
> correlation.

There are quantitative tests of independence available.  I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

> Hopefully you're right that you don't need complete histograms.
> Perhaps there's some statistics concept they don't teach in stats
> 101 that would cover this need.  What we're looking for is a
> function f(a,b,x) that gives the net selectivity given a and b, the
> selectivity of two clauses based on two columns, and x some simple
> value that can be easily calculated by looking at the data in
> advance.

That would be neat :)

Cheers,
D
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

In response to

Responses

pgsql-hackers by date

Next:From: Christopher Kings-LynneDate: 2006-06-02 01:38:54
Subject: Re: [PATCH] Magic block for modules
Previous:From: Greg StarkDate: 2006-06-02 00:36:16
Subject: Re: More thoughts about planner's cost estimates

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group