From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Greg Stark <gsstark(at)mit(dot)edu> |
Cc: | josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Improving N-Distinct estimation by ANALYZE |
Date: | 2006-01-05 00:57:01 |
Message-ID: | 1136422621.21025.244.camel@localhost.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
> I think you're right that a reasonable sample size for this kind of
> estimate
> is going to be proportional to the table size, not the constant sized
> sample
> that regular statistics need.
Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.
I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).
Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.
> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good.
> >
> > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> >
> > It does require a single full table scan
The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.
Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.
Best Regards, Simon Riggs
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2006-01-05 01:08:18 | Heads up: upcoming back-branch re-releases |
Previous Message | Bruce Momjian | 2006-01-05 00:39:42 | Re: Stats collector performance improvement |