Re: Query optimizer 8.0.1 (and 8.0)

From: pgsql(at)mohawksoft(dot)com
To: "Bruno Wolff III" <bruno(at)wolff(dot)to>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 18:28:04
Message-ID: 16805.24.91.171.78.1107800884.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mon, Feb 07, 2005 at 11:27:59 -0500,
> pgsql(at)mohawksoft(dot)com wrote:
>>
>> It is inarguable that increasing the sample size increases the accuracy
>> of
>> a study, especially when diversity of the subject is unknown. It is
>> known
>> that reducing a sample size increases probability of error in any poll
>> or
>> study. The required sample size depends on the variance of the whole. It
>> is mathmatically unsound to ASSUME any sample size is valid without
>> understanding the standard deviation of the set.
>
> For large populations the accuracy of estimates of statistics based on
> random
> samples from that population are not very sensitve to population size and
> depends primarily on the sample size. So that you would not expect to need
> to use larger sample sizes on larger data sets for data sets over some
> minimum size.

That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the data, then a sample of one works fine.

If the standard deviation is high, then you need more samples. If you have
a high standard deviation and a large data set, you need more samples than
you would need for a smaller data set.

In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%

Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martin Pitt 2005-02-07 19:05:11 Re: libpq API incompatibility between 7.4 and 8.0
Previous Message Abhijit Menon-Sen 2005-02-07 17:58:10 Re: Patent issues and 8.1