Re: More thoughts about planner's cost estimates

From: David Fetter <david(at)fetter(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: josh(at)agliodbs(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-02 01:04:12
Message-ID: 20060602010412.GI10906@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
>
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>
> > Greg, Tom,
> >
> > > a) We already use block based sampling to reduce overhead. If
> > > you're talking about using the entire block and not just
> > > randomly sampled tuples from within those blocks then your
> > > sample will be biased.
> >
> > There are actually some really good equations to work with this,
> > estimating both row correlation and n-distinct based on sampling
> > complete blocks. See prior discussions on this list on
> > N-distinct.
>
> In the prior discussions someone posted the paper with the algorithm
> I mentioned. That paper mentions that previous work showed poor
> results at estimating n_distinct even with sample sizes as large as
> 50% or more.

Which paper? People have referenced several different ones.

> > > 1) You have n^2 possible two-column combinations. That's a lot
> > > of processing and storage.
> >
> > Yes, that's the hard problem to solve. Actually, btw, it's n!,
> > not n^2.
>
> You have O(n^2) possible *two-column* combinations and O(n!)
> n-column combinations. I was hoping two-column combinations would
> be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
"well enough" by ass-u-ming a lack of higher-order correllations.

> > > 2) It isn't even clear what data you're exactly looking for.
> > > Certainly "correlation" is just shorthand here and isn't what
> > > you're actually looking for.
> >
> > Actually, I'd think that a correlation number estimate (0 =
> > complete uncorrelated, 1 = completely correlated) would be
> > sufficient to improve row count estimation significantly, without
> > incurring the vast overhead of histogramXhistorgram manipulation.
>
> No, correlation is actually quite uninteresting here. Consider for
> example two columns "state" and "area code". The "area code" is
> pretty much 100% dependent on state, but will show very little
> correlation.

There are quantitative tests of independence available. I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

> Hopefully you're right that you don't need complete histograms.
> Perhaps there's some statistics concept they don't teach in stats
> 101 that would cover this need. What we're looking for is a
> function f(a,b,x) that gives the net selectivity given a and b, the
> selectivity of two clauses based on two columns, and x some simple
> value that can be easily calculated by looking at the data in
> advance.

That would be neat :)

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2006-06-02 01:38:54 Re: [PATCH] Magic block for modules
Previous Message Greg Stark 2006-06-02 00:36:16 Re: More thoughts about planner's cost estimates