Re: On Distributions In 7.2 (Longish)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2 (Longish)
Date: 2001-10-27 18:50:03
Message-ID: 3445.1004208603@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> I have been examining the behaviour of these additions, using the dataset
> from my "warehouse comparison" as a test bed - as it has some large-ish
> tables with controllable data distributions.

Many thanks for this! I haven't yet gotten much feedback on real-world
performance of the new statistics code.

> --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;

> most_common_vals
> {"2243","2751","105","250","525","1623","2112","2331","2983","28"}
> Most_common_freqs
> {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"}
> n_distinct 36511

> Note we are out by an order of magnitude here for number distinct -
> should be 3000, and the frequencies are a little overestimated -
> should be 0.001

The factor-of-2 error for the "most common" frequency doesn't bother me;
that's to be expected, considering that we're using a random sampling
of the table rows. However, the factor-of-10 error in the n_distinct
estimate is more troubling. Could you trace through the code that
produces that estimate (in current sources, near line 1310 of
src/backend/commands/analyze.c) and see if you can tell why it's so
far off?

Another interesting thing to look at is how much the stats change in
repeated runs of ANALYZE. Since we're taking a random sample,
successive runs can be expected to produce slightly different answers.
I would like to think that the answers won't change too far ... but
maybe this n_distict estimate was an outlier.

> ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;

> most_common_vals {"328","1242","524","1515","2058","2168",( 94 more
> values)...
> most_common_freqs
> {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",....
> n_distinct 3027

> Now the number of distinct values is very accurate and frequencies are a
> little underestimated.

This is a little strange as well. If the true frequency of every d0key
value is 0.001, then the random sample should produce a few estimates
larger than that as well as a few smaller; so I'd expect the "most
common" entries to be larger than 0.001. Furthermore, there is code in
there that's supposed to suppress "most common" entries that aren't
really much more common than the estimated average (cf lines 1360-1380).
Why didn't that trigger?

Looks like I may have some bugs to fix here.

> The conclusion here seems to be that the 10 quantiles are not quite enough
> for accurate distributional data where (large) tables have a few thousand
> distinct values. However 100 quantiles was sufficient to get accurate
> statistics.
> Experimentation showed that accurate estimates (+/- 10%) of number of
> distinct values did not begin to appear until about 75 quantiles were used.

One of the things I would like to establish before we finish beta is
whether the default stats target of 10 is large enough. I am not very
comfortable with raising it as far as 75-100, but would not be fazed
with numbers around 20-30. I appreciate your feedback on this point.

I wonder, though, whether what we're looking at isn't just a problem
with the number-of-distinct-values estimation equation. The one that's
in there seemed a little phony to me, but I couldn't find anything else
in a cursory literature search.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Keary Suska 2001-10-27 19:00:37 Re: performance tuning
Previous Message Sheer El-Showk 2001-10-27 17:58:42 performance tuning