Re: More thoughts about planner's cost estimates

From: Greg Stark <gsstark(at)mit(dot)edu>
To: josh(at)agliodbs(dot)com
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-02 00:36:16
Message-ID: 87lksgfl6n.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Josh Berkus <josh(at)agliodbs(dot)com> writes:

> Greg, Tom,
>
> > a) We already use block based sampling to reduce overhead. If you're
> > talking about using the entire block and not just randomly sampled
> > tuples from within those blocks then your sample will be biased.
>
> There are actually some really good equations to work with this, estimating
> both row correlation and n-distinct based on sampling complete blocks.
> See prior discussions on this list on N-distinct.

In the prior discussions someone posted the paper with the algorithm I
mentioned. That paper mentions that previous work showed poor results at
estimating n_distinct even with sample sizes as large as 50% or more.

> > 1) You have n^2 possible two-column combinations. That's a lot of
> > processing and storage.
>
> Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.

You have O(n^2) possible *two-column* combinations and O(n!) n-column
combinations. I was hoping two-column combinations would be enough information
to extrapolate from for larger combinations.

> > 2) It isn't even clear what data you're exactly looking for. Certainly
> > "correlation" is just shorthand here and isn't what you're actually
> > looking for.
>
> Actually, I'd think that a correlation number estimate (0 = complete
> uncorrelated, 1 = completely correlated) would be sufficient to improve
> row count estimation significantly, without incurring the vast overhead of
> histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for example two
columns "state" and "area code". The "area code" is pretty much 100% dependent
on state, but will show very little correlation. Similarly things like
"product name" and "catalog number" or even "author" and "genre" would be
problem cases but have little correlation. And you can easily imagine queries
that specify restrictive clauses on two such columns for which the existing
statistics will overestimate the selectivity because it assumes no
interdependency.

Hopefully you're right that you don't need complete histograms. Perhaps
there's some statistics concept they don't teach in stats 101 that would cover
this need. What we're looking for is a function f(a,b,x) that gives the net
selectivity given a and b, the selectivity of two clauses based on two
columns, and x some simple value that can be easily calculated by looking at
the data in advance.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2006-06-02 01:04:12 Re: More thoughts about planner's cost estimates
Previous Message Tom Lane 2006-06-01 23:43:10 Re: More thoughts about planner's cost estimates