Re: Query optimizer 8.0.1 (and 8.0)

From: Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To: pgsql(at)mohawksoft(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 21:22:09
Message-ID: 4207DC01.9060102@coretech.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Maybe I am missing something - ISTM that you can increase your
statistics target for those larger tables to obtain a larger (i.e.
better) sample.

regards

Mark

pgsql(at)mohawksoft(dot)com wrote:
>>pgsql(at)mohawksoft(dot)com writes:
> Any and all random sampling assumes a degree of uniform distribution. This
> is the basis of the model. It assumes that chunks of the whole will be
> representative of the whole (to some degree). This works when normal
> variations are more or less distributed uniformly. As variations and
> trends becomes less uniformly distributed, more samples are required to
> characterize it.
>
> Douglas Adams had a great device called the "Total Perspective Vortex"
> which infered the whole of the universe from a piece of fairy cake. It was
> a subtle play on the absurd notion that a very small sample could lead to
> an understanding of an infinitly larger whole.
>
> On a very basic level, why bother sampling the whole table at all? Why not
> check one block and infer all information from that? Because we know that
> isn't enough data. In a table of 4.6 million rows, can you say with any
> mathmatical certainty that a sample of 100 points can be, in any way,
> representative?
>
> Another problem with random sampling is trend analysis. Often times there
> are minor trends in data. Ron pointed out the lastname firstname trend.
> Although there seems to be no correlation between firstnames in the table,
> there are clearly groups or clusters of ordered data that is an ordering
> that is missed by too small a sample.
>
> I understand why you chose the Vitter algorithm, because it provides a
> basically sound methodology for sampling without knowledge of the size of
> the whole, but I think we can do better. I would suggest using the current
> algorithm the first time through, then adjust the number of samples [n]
> based on the previous estimate of the size of the table [N]. Each
> successive ANALYZE will become more accurate. The Vitter algorithm is
> still useful as [N] will always be an estimate.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2005-02-07 22:01:41 Re: Is there a way to make VACUUM run completely outside
Previous Message Bruno Wolff III 2005-02-07 20:34:53 Re: Query optimizer 8.0.1 (and 8.0)