Skip site navigation (1) Skip section navigation (2)

Re: Improving N-Distinct estimation by ANALYZE

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 00:57:01
Message-ID: 1136422621.21025.244.camel@localhost.localdomain (view raw or flat)
Thread:
Lists: pgsql-hackers
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
> I think you're right that a reasonable sample size for this kind of
> estimate
> is going to be proportional to the table size, not the constant sized
> sample
> that regular statistics need. 

Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.

I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).

Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.

> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good. 
> > 
> >  http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> > 
> > It does require a single full table scan 

The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.

Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.

Best Regards, Simon Riggs


In response to

pgsql-hackers by date

Next:From: Tom LaneDate: 2006-01-05 01:08:18
Subject: Heads up: upcoming back-branch re-releases
Previous:From: Bruce MomjianDate: 2006-01-05 00:39:42
Subject: Re: Stats collector performance improvement

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group