Re: When to bump up statistics?

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: When to bump up statistics?
Date: 2004-11-19 20:23:15
Message-ID: 608y8xfv0s.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

qnex42(at)gmail(dot)com (Dawid Kuroczko) writes:
> ALTER TABLE foo ALTER COLUMN bar SET STATISTICS n; .....
>
> I wonder what are the implications of using this statement,
> I know by using, say n=100, ANALYZE will take more time,
> pg_statistics will be bigger, planner will take longer time,
> on the other hand it will make better decisions... Etc, etc.
>
> I wonder however when it is most uselful to bump it up.
> Please tell me what you think about it:
>
> Is bumping up statistics is only useful for indexed columns?

The main decision changes that result from this would occur then...

> When is it most useful/benefitial to bump them up:
>
> 1) huge table with huge number of distinct values (_almost_
> unique ;))
>
> 2) huge table with relatively equally distributed values
> (like each value is in between, say, 30-50 rows).
>
> 3) huge table with unequally distributed values (some
> values are in 1-5 rows, some are in 1000-5000 rows).
>
> 4) huge table with small number values (around ~100
> distinct values, equally or uneqally distributed).

A hard and fast rule hasn't emerged, definitely not to distinguish
precisely between these cases.

There are two effects that come out of changing the numbers:

1. They increase the number of tuples examined.

This would pointedly affect cases 3 and 4, increasing the
likelihood that the statistics are more representative

2. They increase the number of samples that are kept, increasing the
number of items recorded in the histogram.

If you have on the order of 100 unique values (it would not be
unusual for a company to have 100 "main" customers or suppliers),
that allows there to be nearly a bin apiece, which makes
estimates _way_ more representative both for common and less
common cases amongst the "top 100."

Both of those properties are useful for pretty much all of the above
cases.

> 5) boolean column.

Boolean column would more or less indicate SET STATISTICS 2; the only
point to having more would be if there was one of the values that
almost never occurred so that you'd need to collect more stats to even
pick up instances of the "rare" case.

A boolean column is seldom much use for indices anyways...

> I think SET STATISTICS 100 is very useful for case with unequally
> distributed values, but I wonder what about the other cases. And as
> a side note -- what are the reasonable bounds for statistics
> (between 10 and 100?)

If there are, say, 200 unique values, then increasing from 10 to 100
would seem likely to be useful in making the histogram MUCH more
representative...

> What are the runtime implications of setting statistics too large --
> how much can it affect queries?

More stats would mean a bit more time evaluating query plans, but the
quality of the plans should be better.

> And finally -- how other RDBMS and RDBM-likes deal with this issue?
> :)

For Oracle and DB/2, the issues are not dissimilar. Oracle somewhat
prefers the notion of collecting comprehensive statistics on the whole
table, which will be even more costly than PostgreSQL's sampling.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2004-11-19 20:25:45 Re: sort_mem affect on inserts?
Previous Message Tom Lane 2004-11-19 20:09:31 Re: index use