Re: Query optimizer 8.0.1 (and 8.0)

From: pgsql(at)mohawksoft(dot)com
To: "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-08 14:33:26
Message-ID: 16625.24.91.171.78.1107873206.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>
>> The basic problem with a fixed sample is that is assumes a normal
>> distribution.
>
> That's sort of true, but not in the way you think it is.
>
[snip]

Greg, I think you have an excellent ability to articulate stats, but I
think that the view that this is like election polling is incorrect.

Election polling assumes a very simple outcome: Some standard ditribution
of a limited number options. I don't think it applies to this.

>
>> When you look at a sine wave on an oscilloscope, you can see it clear as
>> day. When you look at music on the scope, you know there are many waves
>> there, but it is difficult to make heads or tails of it. (use xmms or
>> winamp to see for yourself) The waves change in frequency, amplitude,
>> and
>> duration over a very large scale. That's why you use a spectrum analyzer
>> to go from time domain to frequency domain. In frequency domain, you can
>> see the trends better.
>
> That's not a bad analogy to many problems where you're measuring data that
> has
> non-randomness in it but that are not visible in the domain that the
> statistics that are being analyzed. This seems to happen a lot with
> geographic
> data, for instance.

EXACTLY!!!

>
> If you find that increasing the stats targets improves things then this
> isn't true. If you find that it doesn't then what's really needed is a
> cleverer set of statistics to look for.

I will be the first one to say that increasing the samples is not perfect,
but it is a methodology that will help without major changes in postgres.
Simply increasing the samples to a percentage of the estimated number of
rows (with some upper and lower limits of course) will increase the
accuracy of the "n_distinct" and "correlation" settings (at least a little
bit), and that will make a huge impact with very little work.

If we want to discuss improved statatistics, then we should include a
standard deviation and a sliding window deviation, or something like that.
Hell, maybe even FFT.

The basic problem, I think, is that the sampling mechanism is more like an
oscilloscope looking for large trends instead of a spectrum analyzer
looking for the smaller ones.

We have to be able to tell the planner that adjacent values are less
random even though, as a whole, they are seemingly random.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-02-08 15:25:26 Re: float4 regression test failed on linux parisc
Previous Message Jim Buttafuoco 2005-02-08 14:14:35 Fw: Re: float4 regression test failed on linux parisc