Re: Improving N-Distinct estimation by ANALYZE

From: Greg Stark <gsstark(at)mit(dot)edu>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 00:22:05
Message-ID: 87wthf4j82.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:

> Tom,
>
> > In general, estimating n-distinct from a sample is just plain a hard
> > problem, and it's probably foolish to suppose we'll ever be able to
> > do it robustly. What we need is to minimize the impact when we get
> > it wrong.
>
> Well, I think it's pretty well proven that to be accurate at all you need
> to be able to sample at least 5%, even if some users choose to sample
> less. Also I don't think anyone on this list disputes that the current
> algorithm is very inaccurate for large tables. Or do they?

I think it's worse than that. It's proven that to get any kind of accuracy in
this kind of estimate you basically have to look at the entire table.

Someone posted a URL to a paper that had a very clever way of estimating
distinct values. It required scanning the entire table but only kept a sample
of the values found using a method that guaranteed the sample was
representative not of the entire table but of the distinct values.

I think you're right that a reasonable sample size for this kind of estimate
is going to be proportional to the table size, not the constant sized sample
that regular statistics need. However that same paper has some preliminary
verbiage about how previous papers had found that the sample sizes needed were
unacceptably large. Something like 90%.

Unfortunately I can't find the reference to the paper this moment. I'll look
again later.

I think it would be interesting to get the algorithm for sampling from that
paper in. It would only be able to be used on a VACUUM ANALYZE but currently
VACUUMs are necessary anyways. I do wonder what would happen when we get the
incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily
though. Then it would be less useful.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2006-01-05 00:24:25 Re: Improving N-Distinct estimation by ANALYZE
Previous Message Josh Berkus 2006-01-05 00:07:37 Re: Improving N-Distinct estimation by ANALYZE