Re: On Distributions In 7.2 (Longish)

From: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2 (Longish)
Date: 2001-10-27 23:50:47
Message-ID: 01102812504700.01068@spikey.slithery.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom

1) I will look at analyze.c and
2) do more analyze runs with 10 buckets to see if I am getting some bad
probabilistic effects.

With respect to 2), I will also try the same examination on a Freebsd 4.4 box
that I have got here ( just to check that the Linux Mandrake 8.0 machine I
use is not the source of any randomness problems)

regards

Mark

>On Sunday 28 October 2001 07:50, Tom Lane wrote:
> Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> > I have been examining the behaviour of these additions, using the dataset
> > from my "warehouse comparison" as a test bed - as it has some large-ish
> > tables with controllable data distributions.
>
> Many thanks for this! I haven't yet gotten much feedback on real-world
> performance of the new statistics code.
>
> > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;
> >
> > most_common_vals
> > {"2243","2751","105","250","525","1623","2112","2331","2983","28"}
> > Most_common_freqs
> > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667"
> >,"0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct
> > 36511
> >
> > Note we are out by an order of magnitude here for number distinct -
> > should be 3000, and the frequencies are a little overestimated -
> > should be 0.001
>
> The factor-of-2 error for the "most common" frequency doesn't bother me;
> that's to be expected, considering that we're using a random sampling
> of the table rows. However, the factor-of-10 error in the n_distinct
> estimate is more troubling. Could you trace through the code that
> produces that estimate (in current sources, near line 1310 of
> src/backend/commands/analyze.c) and see if you can tell why it's so
> far off?
>
> Another interesting thing to look at is how much the stats change in
> repeated runs of ANALYZE. Since we're taking a random sample,
> successive runs can be expected to produce slightly different answers.
> I would like to think that the answers won't change too far ... but
> maybe this n_distict estimate was an outlier.
>
> > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
> >
> > most_common_vals {"328","1242","524","1515","2058","2168",( 94 more
> > values)...
> > most_common_freqs
> > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667
> >","0.000666667","0.000666667","0.000633333",.... n_distinct 3027
> >
> > Now the number of distinct values is very accurate and frequencies are a
> > little underestimated.
>
> This is a little strange as well. If the true frequency of every d0key
> value is 0.001, then the random sample should produce a few estimates
> larger than that as well as a few smaller; so I'd expect the "most
> common" entries to be larger than 0.001. Furthermore, there is code in
> there that's supposed to suppress "most common" entries that aren't
> really much more common than the estimated average (cf lines 1360-1380).
> Why didn't that trigger?
>
> Looks like I may have some bugs to fix here.
>
> > The conclusion here seems to be that the 10 quantiles are not quite
> > enough for accurate distributional data where (large) tables have a few
> > thousand distinct values. However 100 quantiles was sufficient to get
> > accurate statistics.
> > Experimentation showed that accurate estimates (+/- 10%) of number of
> > distinct values did not begin to appear until about 75 quantiles were
> > used.
>
> One of the things I would like to establish before we finish beta is
> whether the default stats target of 10 is large enough. I am not very
> comfortable with raising it as far as 75-100, but would not be fazed
> with numbers around 20-30. I appreciate your feedback on this point.
>
> I wonder, though, whether what we're looking at isn't just a problem
> with the number-of-distinct-values estimation equation. The one that's
> in there seemed a little phony to me, but I couldn't find anything else
> in a cursory literature search.
>
> regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Carlos Augusto 2001-10-28 03:21:25 IDE
Previous Message Sheer El-Showk 2001-10-27 22:57:19 Re: performance tuning