Re: n_distinct way off, but following a pattern.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: nickf(at)ontko(dot)com
Cc: "Pgsql-Performance(at)Postgresql(dot) Org" <pgsql-performance(at)postgresql(dot)org>, "Ray Ontko" <rayo(at)ontko(dot)com>
Subject: Re: n_distinct way off, but following a pattern.
Date: 2003-11-14 22:12:27
Message-ID: 5809.1068847947@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

"Nick Fankhauser" <nickf(at)ontko(dot)com> writes:
> I'm seeing estimates for n_distinct that are way off for a large table

Estimating n_distinct from a small sample is inherently a hard problem.
I'm not surprised that the estimates would get better as the sample size
increases. But maybe we can do better. The method we are currently
using is this:

/*----------
* Estimate the number of distinct values using the estimator
* proposed by Haas and Stokes in IBM Research Report RJ 10025:
* n*d / (n - f1 + f1*n/N)
* where f1 is the number of distinct values that occurred
* exactly once in our sample of n rows (from a total of N),
* and d is the total number of distinct values in the sample.
* This is their Duj1 estimator; the other estimators they
* recommend are considerably more complex, and are numerically
* very unstable when n is much smaller than N.

It would be interesting to see exactly what inputs are going into this
equation. Do you feel like adding some debug printouts into this code?
Or just looking at the variables with a debugger? In 7.3 it's about
line 1060 in src/backend/commands/analyze.c.

BTW, this is already our second try at this problem, the original 7.2
equation didn't last long at all ...

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Slavisa Garic 2003-11-15 01:20:20 Re: INSERT extremely slow with large data sets (fwd)
Previous Message radha.manohar 2003-11-14 22:07:26 Re: Error in transaction processing