Tom,
> In general, estimating n-distinct from a sample is just plain a hard
> problem, and it's probably foolish to suppose we'll ever be able to
> do it robustly. What we need is to minimize the impact when we get
> it wrong.
Well, I think it's pretty well proven that to be accurate at all you need
to be able to sample at least 5%, even if some users choose to sample
less. Also I don't think anyone on this list disputes that the current
algorithm is very inaccurate for large tables. Or do they?
While I don't think that we can estimate N-distinct completely accurately,
I do think that we can get within +/- 5x for 80-90% of all cases, instead
of 40-50% of cases like now. We can't be perfectly accurate, but we can
be *more* accurate.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
In response to
Responses
pgsql-hackers by date
| Next: | From: Jim C. Nasby | Date: 2006-01-04 23:57:49 |
| Subject: Re: Improving N-Distinct estimation by ANALYZE |
| Previous: | From: Tom Lane | Date: 2006-01-04 23:22:59 |
| Subject: back-patching locale environment fix |