Re: On Distributions In 7.2

From: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2
Date: 2001-10-29 08:53:56
Message-ID: 01102921535600.04563@spikey.slithery.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote (edited....)
> The factor-of-2 error for the "most common" frequency doesn't bother me;
> that's to be expected, considering that we're using a random sampling
> of the table rows. However, the factor-of-10 error in the n_distinct
> estimate is more troubling. Could you trace through the code that
> produces that estimate (in current sources, near line 1310 of
> src/backend/commands/analyze.c) and see if you can tell why it's so
> far off?

My limited tracing of analyze.c
( adding various elog calls in lines 1288->1344 of Beta 1 sources ) :

ANALYZE fact0;

DEBUG: Analyzing fact0
DEBUG: Analyze : beginning a column
DEBUG: Have 3000 values in relation (?) (numrows)
DEBUG: Have 3000036 total values in relation (totalrows)
DEBUG: Have 3000 values in sample (values_cnt)
DEBUG: Have 1875 distinct values in sample (ndistinct)
DEBUG: Have 813 multiple values in sample (nmultiple)
DEBUG: calc 1116.000000 f1 for Chaudhuri rule
DEBUG: calc 35291.230469 term1 for Chaudhuri rule
DEBUG: calc 34397.000000 distinct via Chaudhuri rule

That term1 seems like the problem. I am thinking of a sort of modifier
function that tames the extreme values for term1 when sample size << rows in
relation, but tends to 1 as sample size ~ rows in relation.

a quick and dirty experiment with a vaguely suitable such function in
analyze.c ( approx line 1333):

/* term1 = sqrt(totalrows / (double) numrows) * f1; */
term1 = sqrt(totalrows / (double) numrows) * f1 /
log10(log10(1000) * totalrows/numrows);

ANALYZE fact0;

DEBUG: Analyzing fact0
DEBUG: Analyze : beginning a column
DEBUG: Have 3000 values in relation
DEBUG: Have 3000036 total values in relation
DEBUG: Have 3000 values in sample
DEBUG: Have 1898 distinct values in sample
DEBUG: Have 798 multiple values in sample
DEBUG: calc 1100.000000 f1 for Modified Chaudhuri rule
DEBUG: calc 10004.025391 term1 for Modified Chaudhuri rule
DEBUG: calc 10802.000000 distinct via Modified Chaudhuri rule

which is better (only factor of 3 out now !)

now see if I have killed the larger sample behaviour :

ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
ANALYZE fact0;

DEBUG: Analyzing fact0
DEBUG: Analyze : beginning a column
DEBUG: Have 30000 values in relation
DEBUG: Have 3000036 total values in relation
DEBUG: Have 30000 values in sample
DEBUG: Have 3000 distinct values in sample
DEBUG: Have 2998 multiple values in sample
DEBUG: calc 2.000000 f1 for Modified Chaudhuri rule
DEBUG: calc 8.073919 term1 for Modified Chaudhuri rule
DEBUG: calc 3006.000000 distinct via Modified Chaudhuri rule

So is kind of doing the right thing, however a modifier function that
_actually_ tends to 1 as numrows -> totalrows (and is more accurate at small
sample size) would be much better ! ( where are those 2nd year math books
when you want them....)

>
> Another interesting thing to look at is how much the stats change in
> repeated runs of ANALYZE. Since we're taking a random sample,
> successive runs can be expected to produce slightly different answers.
> I would like to think that the answers won't change too far ... but
> maybe this n_distict estimate was an outlier.

Repeated tests on the Freebsd 4.4 machine obtained values consistently in the
34000-36000 range, so the sampling method looks great.
>
>
> One of the things I would like to establish before we finish beta is
> whether the default stats target of 10 is large enough. I am not very
> comfortable with raising it as far as 75-100, but would not be fazed
> with numbers around 20-30. I appreciate your feedback on this point.
>
Agreed, 100 seems too big, 10->30 *feels* much better.

regards

Mark

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Mark kirkwood 2001-10-29 09:18:38 More On 7.2 Distributions - Estimates For Number Distinct < 0
Previous Message Jean-Michel POURE 2001-10-29 08:10:04 Re: [HACKERS] Ultimate DB Server