Re: Odd statistics behaviour in 7.2

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gordon A(dot) Runkle" <gar(at)integrated-dynamics(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Odd statistics behaviour in 7.2
Date: 2002-02-16 00:09:42
Message-ID: 16868.1013818182@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Gordon A. Runkle" <gar(at)integrated-dynamics(dot)com> writes:
> [ tale of poor statistical results in a near-unique column ]

After some further digging in the literature I have come across a
number-of-distinct-values estimator that I like better than Chaudhuri's.
It runs like this:

n*d / (n - f1 + f1*n/N)
where f1 is the number of distinct values that occurred
exactly once in our sample of n rows (from a total of N),
and d is the total number of distinct values in the sample.

The nice properties this has include:

1. At f1=0 (all values in sample occurred more than once), the estimate
reduces to d, which I had already determined to be a sensible result.

2. At f1=n (all values were distinct, hence also d=n), the estimate
reduces to N, which again is the most reasonable estimate.

3. The equation is numerically stable even when n is much smaller than
N, because the only cancellation is in the term (n - f1) which we can
compute exactly. A lot of the other equations I looked at depend on
series like (1 - n/N)**i which are going to be really nasty when n/N
is tiny.

In particular, point 2 means that this equation should perform
reasonably well for nearly-unique columns (f1 approaching n),
which was the case you were seeing bad results for.

Attached is a patch that implements this revised equation. Would
appreciate any feedback...

regards, tom lane

*** src/backend/commands/analyze.c.orig Sat Jan 5 19:37:44 2002
--- src/backend/commands/analyze.c Fri Feb 15 18:46:45 2002
***************
*** 1009,1018 ****
{
/*----------
* Estimate the number of distinct values using the estimator
! * proposed by Chaudhuri et al (see citation above). This is
! * sqrt(n/r) * max(f1,1) + f2 + f3 + ...
! * where fk is the number of distinct values that occurred
! * exactly k times in our sample of r rows (from a total of n).
* We assume (not very reliably!) that all the multiply-occurring
* values are reflected in the final track[] list, and the other
* nonnull values all appeared but once. (XXX this usually
--- 1009,1023 ----
{
/*----------
* Estimate the number of distinct values using the estimator
! * proposed by Haas and Stokes in IBM Research Report RJ 10025:
! * n*d / (n - f1 + f1*n/N)
! * where f1 is the number of distinct values that occurred
! * exactly once in our sample of n rows (from a total of N),
! * and d is the total number of distinct values in the sample.
! * This is their Duj1 estimator; the other estimators they
! * recommend are considerably more complex, and are numerically
! * very unstable when n is much smaller than N.
! *
* We assume (not very reliably!) that all the multiply-occurring
* values are reflected in the final track[] list, and the other
* nonnull values all appeared but once. (XXX this usually
***************
*** 1021,1032 ****
*----------
*/
int f1 = nonnull_cnt - summultiple;
! double term1;

! if (f1 < 1)
! f1 = 1;
! term1 = sqrt(totalrows / (double) numrows) * f1;
! stats->stadistinct = floor(term1 + nmultiple + 0.5);
}

/*
--- 1026,1044 ----
*----------
*/
int f1 = nonnull_cnt - summultiple;
! int d = f1 + nmultiple;
! double numer, denom, stadistinct;

! numer = (double) numrows * (double) d;
! denom = (double) (numrows - f1) +
! (double) f1 * (double) numrows / totalrows;
! stadistinct = numer / denom;
! /* Clamp to sane range in case of roundoff error */
! if (stadistinct < (double) d)
! stadistinct = (double) d;
! if (stadistinct > totalrows)
! stadistinct = totalrows;
! stats->stadistinct = floor(stadistinct + 0.5);
}

/*
***************
*** 1313,1332 ****
{
/*----------
* Estimate the number of distinct values using the estimator
! * proposed by Chaudhuri et al (see citation above). This is
! * sqrt(n/r) * max(f1,1) + f2 + f3 + ...
! * where fk is the number of distinct values that occurred
! * exactly k times in our sample of r rows (from a total of n).
* Overwidth values are assumed to have been distinct.
*----------
*/
int f1 = ndistinct - nmultiple + toowide_cnt;
! double term1;

! if (f1 < 1)
! f1 = 1;
! term1 = sqrt(totalrows / (double) numrows) * f1;
! stats->stadistinct = floor(term1 + nmultiple + 0.5);
}

/*
--- 1325,1356 ----
{
/*----------
* Estimate the number of distinct values using the estimator
! * proposed by Haas and Stokes in IBM Research Report RJ 10025:
! * n*d / (n - f1 + f1*n/N)
! * where f1 is the number of distinct values that occurred
! * exactly once in our sample of n rows (from a total of N),
! * and d is the total number of distinct values in the sample.
! * This is their Duj1 estimator; the other estimators they
! * recommend are considerably more complex, and are numerically
! * very unstable when n is much smaller than N.
! *
* Overwidth values are assumed to have been distinct.
*----------
*/
int f1 = ndistinct - nmultiple + toowide_cnt;
! int d = f1 + nmultiple;
! double numer, denom, stadistinct;

! numer = (double) numrows * (double) d;
! denom = (double) (numrows - f1) +
! (double) f1 * (double) numrows / totalrows;
! stadistinct = numer / denom;
! /* Clamp to sane range in case of roundoff error */
! if (stadistinct < (double) d)
! stadistinct = (double) d;
! if (stadistinct > totalrows)
! stadistinct = totalrows;
! stats->stadistinct = floor(stadistinct + 0.5);
}

/*

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc G. Fournier 2002-02-16 00:14:40 Re: Ready to branch 7.2/7.3 ?
Previous Message Peter Eisentraut 2002-02-15 23:15:15 Parser conflicts in extended GRANT statement