Re: Improving N-Distinct estimation by ANALYZE

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 00:57:01
Message-ID: 1136422621.21025.244.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
> I think you're right that a reasonable sample size for this kind of
> estimate
> is going to be proportional to the table size, not the constant sized
> sample
> that regular statistics need.

Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.

I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).

Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.

> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good.
> >
> > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> >
> > It does require a single full table scan

The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.

Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.

Best Regards, Simon Riggs

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-01-05 01:08:18 Heads up: upcoming back-branch re-releases
Previous Message Bruce Momjian 2006-01-05 00:39:42 Re: Stats collector performance improvement