Skip site navigation (1) Skip section navigation (2)

Re: Improving N-Distinct estimation by ANALYZE

From: Greg Stark <gsstark(at)mit(dot)edu>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 00:22:05
Message-ID: 87wthf4j82.fsf@stark.xeocode.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Josh Berkus <josh(at)agliodbs(dot)com> writes:

> Tom,
> 
> > In general, estimating n-distinct from a sample is just plain a hard
> > problem, and it's probably foolish to suppose we'll ever be able to
> > do it robustly.  What we need is to minimize the impact when we get
> > it wrong.  
> 
> Well, I think it's pretty well proven that to be accurate at all you need 
> to be able to sample at least 5%, even if some users choose to sample 
> less.   Also I don't think anyone on this list disputes that the current 
> algorithm is very inaccurate for large tables.  Or do they?

I think it's worse than that. It's proven that to get any kind of accuracy in
this kind of estimate you basically have to look at the entire table.

Someone posted a URL to a paper that had a very clever way of estimating
distinct values. It required scanning the entire table but only kept a sample
of the values found using a method that guaranteed the sample was
representative not of the entire table but of the distinct values.

I think you're right that a reasonable sample size for this kind of estimate
is going to be proportional to the table size, not the constant sized sample
that regular statistics need. However that same paper has some preliminary
verbiage about how previous papers had found that the sample sizes needed were
unacceptably large. Something like 90%.

Unfortunately I can't find the reference to the paper this moment. I'll look
again later.

I think it would be interesting to get the algorithm for sampling from that
paper in. It would only be able to be used on a VACUUM ANALYZE but currently
VACUUMs are necessary anyways. I do wonder what would happen when we get the
incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily
though. Then it would be less useful.

-- 
greg


In response to

Responses

pgsql-hackers by date

Next:From: Simon RiggsDate: 2006-01-05 00:24:25
Subject: Re: Improving N-Distinct estimation by ANALYZE
Previous:From: Josh BerkusDate: 2006-01-05 00:07:37
Subject: Re: Improving N-Distinct estimation by ANALYZE

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group