Skip site navigation (1) Skip section navigation (2)

Re: More thoughts about planner's cost estimates

From: Greg Stark <gsstark(at)mit(dot)edu>
To: David Fetter <david(at)fetter(dot)org>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, josh(at)agliodbs(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-02 17:08:13
Message-ID: 87ac8vfptu.fsf@stark.xeocode.com (view raw or flat)
Thread:
Lists: pgsql-hackers
David Fetter <david(at)fetter(dot)org> writes:

> > In the prior discussions someone posted the paper with the algorithm
> > I mentioned.  That paper mentions that previous work showed poor
> > results at estimating n_distinct even with sample sizes as large as
> > 50% or more.
> 
> Which paper?  People have referenced several different ones.

Phillip B Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct
Values Queries and Event Reports. In Proceedings of the 27th VLDB Conference,
Roma, Italy, 2001.

Which says (emphasis in original as italics):

    The most well-studied approach for distinct-values estimation is to
    collect a uniform random sample S of the data, store S in a database, and
    then use S at query time to provide fast, approximate answers to distinct
    values queries [22, 23, 27, 21, 29, 5, 28, 18, 19, 9, 7]. However,
    previous work [28, 18, 9, 7] has shown powerful negative results on the
    quality of distinct-values estimates based on sampling (or other
    techniques that examine only part of the input data), even for the simple
    case of counting the number of distinct values in a column. The strongest
    negative result is due to Charikar et al. [7], who proved that estimating
    the number of distinct values in a column to within a small constant
    factor (with probability > 1/2) requires that *nearly* *the* *entire*
    *data* *set* *be* *sampled*. Moreover, all known sampling-based estimators
    provide unsatisfactory results on data sets of interest [7], even for this
    simple case.

And later:

    Using a variety of synthetic and real-world data sets, we show that
    distinct sampling gives estimates for distinct values queries that are
    within 0%-10%, whereas previous methods were typically 50%-250% off,
    across the spectrum of data sets and queries studied.

Here "distinct sampling" is the algorithm being proposed which requires
looking at every record and keeping a sample *of the distinct values*. The
"previous methods" are methods based on sampling the records

I haven't read the citation [7] that proves the strong negative result for any
estimator of distinct values based on sampling. It's:

  M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation
  error guarantees for distinct values. In Proc. 19th ACM Symp. on Principles
  of Database Systems, pages 268?279, May 2000.


> > Hopefully you're right that you don't need complete histograms.
> > Perhaps there's some statistics concept they don't teach in stats
> > 101 that would cover this need.  What we're looking for is a
> > function f(a,b,x) that gives the net selectivity given a and b, the
> > selectivity of two clauses based on two columns, and x some simple
> > value that can be easily calculated by looking at the data in
> > advance.
> 
> That would be neat :)

Doing a bit of basic searching around I think the tool we're looking for here
is called a "chi-squared test for independence".

I haven't read up on how it works so I'm unclear if i it be calculated using a
simple O(n) scan or if it would require some non-linear post-processing after
the analyze pass which would be unfortunate.

And I haven't found anything that describes how to make use of the resulting
number. Does it actually give a formula f(a,b,x) that gives a nice convenient
expected selectivity for the clause?

Oh, and incidentally, at first glance it seems like calculating it depends on
having good distinct value sampling.

-- 
greg


In response to

Responses

pgsql-hackers by date

Next:From: Larry RosenmanDate: 2006-06-02 17:20:33
Subject: Re: Going for 'all green' buildfarm results
Previous:From: Alvaro HerreraDate: 2006-06-02 16:28:53
Subject: Re: COPY (query) TO file

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group