Re: Query optimizer 8.0.1 (and 8.0)

From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: pgsql(at)mohawksoft(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 20:34:53
Message-ID: 20050207203453.GA7517@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 07, 2005 at 13:28:04 -0500,
> >
> > For large populations the accuracy of estimates of statistics based on
> > random
> > samples from that population are not very sensitve to population size and
> > depends primarily on the sample size. So that you would not expect to need
> > to use larger sample sizes on larger data sets for data sets over some
> > minimum size.
>
> That assumes a fairly low standard deviation. If the standard deviation is
> low, then a minimal sample size works fine. If there was zero deviation in
> the data, then a sample of one works fine.

This doesn't assume a low standard deviation. That wouldn't make sense
anyway since the standard deviation depends on the units used for
the measurements.

> In the current implementation of analyze.c, the default is 100 samples. On
> a table of 10,000 rows, that is probably a good number characterize the
> data enough for the query optimizer (1% sample). For a table with 4.6
> million rows, that's less than 0.002%

The fraction of rows isn't relevant unless it is a large fraction of the
total, in which case you can use a smaller sample than you might otherwise.

> Think about an iregularly occuring event, unevenly distributed throughout
> the data set. A randomized sample strategy normalized across the whole
> data set with too few samples will mischaracterize the event or even miss
> it altogether.

What you are saying here is that if you want more accurate statistics, you
need to sample more rows. That is true. However, the size of the sample
is essentially only dependent on the accuracy you need and not the size
of the population, for large populations.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Kirkwood 2005-02-07 21:22:09 Re: Query optimizer 8.0.1 (and 8.0)
Previous Message Martin Pitt 2005-02-07 19:55:58 Re: libpq API incompatibility between 7.4 and 8.0