Re: Better default_statistics_target

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-12-05 20:13:48
Message-ID: 608x48yk5f.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

guillaume(dot)smet(at)gmail(dot)com ("Guillaume Smet") writes:
> On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg(at)turnstep(dot)com> wrote:
>> Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is
>> there a reason not to make this change? I know I've been lazy and not run
>> any absolute figures, but rough tests show that raising it (from 10 to
>> 100) results in a very minor increase in analyze time, even for large
>> databases. I think the burden of a slightly slower analyze time, which
>> can be easily adjusted, both in postgresql.conf and right before running
>> an analyze, is very small compared to the pain of some queries - which worked
>> before - suddenly running much, much slower for no apparent reason at all.
>
> As Tom stated it earlier, the ANALYZE slow down is far from being the
> only consequence. The planner will also have more work to do and
> that's the hard point IMHO.
>
> Without studying the impacts of this change on a large set of queries
> in different cases, it's quite hard to know for sure that it won't
> have a negative impact in a lot of cases.
>
> It's a bit too late in the cycle to change that IMHO, especially
> without any numbers.

I have the theory (thus far not borne out by any numbers) that it
might be a useful approach to try to go through the DB schema and use
what information is there to try to come up with better numbers on a
per-column basis.

As a "first order" perspective on things:

- Any columns marked "unique" could keep to having somewhat smaller
numbers of bins in the histogram because we know that uniqueness
will keep values dispersed at least somewhat.

Ditto for "SERIAL" types.

- Columns NOT marked unique should imply adding some bins to the
histogram.

- Datestamps tend to imply temporal dispersion, ergo "somewhat fewer
bins." Similar for floats.

- Discrete values (integer, text) frequently see less dispersion,
-> "more bins"

Then could come a "second order" perspective, where data would
actually get sampled from pg_statistics.

- If we look at the number of distinct histogram bins used, for a
particular column, and find that there are some not used, we might
drop bins.

- We might try doing some summary statistics to see how many unique
values there actually are, on each column, and increase the number of
bins if they're all in use, and there are other values that *are*
frequently used.

Maybe cheaper, if we find that pg_statistics tells us that all bins
are in use, and extrapolation shows that there's a lot of the table
NOT represented, we increase the number of bins.

There might even be a "third order" analysis, where you'd try to
collect additional data from the table, and analytically try to
determine appropriate numbers of bins...

Thus, we don't have a universal increase in the amount of statistics
collected - the added stats are localized to places where there is
some reason to imagine them useful.
--
let name="cbbrowne" and tld="acm.org" in String.concat "@" [name;tld];;
http://cbbrowne.com/info/nonrdbms.html
There was a young lady of Crewe
Whose limericks stopped at line two.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2007-12-05 20:27:49 Re: buildenv.pl/buildenv.bat
Previous Message Alvaro Herrera 2007-12-05 20:11:38 Re: Open items for 8.3

Browse pgsql-patches by date

  From Date Subject
Next Message Gregory Stark 2007-12-06 00:00:30 Re: Better default_statistics_target
Previous Message Andrew Chernow 2007-12-05 20:03:02 Re: PQParam version 0.5