Re: Query optimizer 8.0.1 (and 8.0)

From: pgsql(at)mohawksoft(dot)com
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 14:30:45
Message-ID: 16881.24.91.171.78.1107786645.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>> One of the things that is disturbing to me about the analyze settings is
>> that it wants to sample the same number of records from a table
>> regardless
>> of the size of that table.
>
> The papers that I looked at say that this rule has a good solid
> statistical foundation, at least for the case of estimating histograms.
> See the references cited in analyze.c.

Any and all random sampling assumes a degree of uniform distribution. This
is the basis of the model. It assumes that chunks of the whole will be
representative of the whole (to some degree). This works when normal
variations are more or less distributed uniformly. As variations and
trends becomes less uniformly distributed, more samples are required to
characterize it.

Douglas Adams had a great device called the "Total Perspective Vortex"
which infered the whole of the universe from a piece of fairy cake. It was
a subtle play on the absurd notion that a very small sample could lead to
an understanding of an infinitly larger whole.

On a very basic level, why bother sampling the whole table at all? Why not
check one block and infer all information from that? Because we know that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?

Another problem with random sampling is trend analysis. Often times there
are minor trends in data. Ron pointed out the lastname firstname trend.
Although there seems to be no correlation between firstnames in the table,
there are clearly groups or clusters of ordered data that is an ordering
that is missed by too small a sample.

I understand why you chose the Vitter algorithm, because it provides a
basically sound methodology for sampling without knowledge of the size of
the whole, but I think we can do better. I would suggest using the current
algorithm the first time through, then adjust the number of samples [n]
based on the previous estimate of the size of the table [N]. Each
successive ANALYZE will become more accurate. The Vitter algorithm is
still useful as [N] will always be an estimate.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-02-07 15:28:25 Re: Inline MemoryContextSwitchTo?
Previous Message Kenneth Marshall 2005-02-07 14:14:28 Re: Thinking about breaking up the BufMgrLock