Re: benchmarking the query planner

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: benchmarking the query planner
Date: 2008-12-12 14:59:35
Message-ID: 1229093975.13078.275.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote:
> On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >> The existing sampling mechanism is tied to solid statistics. It
> >> provides the correct sample size to get a consistent confidence range
> >> for range queries. This is the same mathematics which governs election
> >> polling and other surveys. The sample size you need to get +/- 5% 19
> >> times out of 20 increases as the population increases, but not by very
> >> much.
> >
> > Sounds great, but its not true. The sample size is not linked to data
> > volume, so how can it possibly give a consistent confidence range?
>
> I'm not 100% sure how relevant it is to this case, but I think what
> Greg is referring to is:
>
> http://en.wikipedia.org/wiki/Margin_of_error#Effect_of_population_size
>
> It is a pretty well-known mathematical fact that for something like an
> opinion poll your margin of error does not depend on the size of the
> population but only on the size of your sample.

Yes, I agree with that *but* that refers to population statistics and
has nothing at all to do with the calculation of ndistinct, which is
what we were discussing. You can't just switch topics and have the
statement remain true.

ndistinct estimation is improved by larger sample sizes, that's what the
maths says and what experimentation shows also.

Note that the estimator we use was shown to be stable in the range of
sample size between 5-20%.
http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf
We currently use a sample size of 300*stats_target. With default=10 that
means our sample size is 0.3% on a 1 million row table, and 0.003% on a
100 million row table (they're common, I find).

That size of sample is OK for some kinds of statistics, but not much
good for ndistinct estimation.

These issues only show up in the field, they never show up on optimizer
test platforms because they typically are several orders of magnitude
too small. (Conversely, the stats system works very well indeed for
smaller tables... so I'm not trying to debunk everything).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2008-12-12 15:03:34 Re: WIP: default values for function parameters
Previous Message KaiGai Kohei 2008-12-12 14:58:48 Re: Updates of SE-PostgreSQL 8.4devel patches (r1268)