Quick Links

Re: Improving N-Distinct estimation by ANALYZE

From:	Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Subject:	Re: Improving N-Distinct estimation by ANALYZE
Date:	2006-01-06 23:56:04
Message-ID:	20060106235604.GD7943@it.is.rice.edu
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jan 06, 2006 at 06:36:52PM -0500, Greg Stark wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>
> > > These numbers don't make much sense to me. It seems like 5% is about as
> > > slow as reading the whole file which is even worse than I expected. I
> > > thought I was being a bit pessimistic to think reading 5% would be as
> > > slow as reading 20% of the table.
> >
> > It's about what *I* expected. Disk seeking is the bane of many access
> > methods.
>
> Sure, but that bad? That means realistic random_page_cost values should be
> something more like 20 rather than 4. And that's with seeks only going to
> subsequent blocks in a single file, which one would expect to average less
> than the half rotation that a random seek would average. That seems worse than
> anyone expects.
>
> > Anyway, since the proof is in the pudding, Simon and I will be working on
> > some demo code for different sampling methods so that we can debate
> > results rather than theory.
>
> Note that if these numbers are realistic then there's no i/o benefit to any
> sampling method that requires anything like 5% of the entire table and is
> still unreliable. Instead it makes more sense to implement an algorithm that
> requires a full table scan and can produce good results more reliably.
>
> --
> greg
>

I have not looked closely at the sampling process in PostgreSQL, but given
that getting an accurate answer is expensive in terms of read/seeks, would
it be possible to iteratively refine the estimate overtime? Ideally we could
use the same disk blocks that are already being read to satisfy a query.
The initial estimate could be done with a minimal impact and then we piggy-
back the I/O needed to refine the estimate on the normal query I/O. The
obvious limit would be if the entire table were scanned we would get a 100%
exact estimate of the distict value at that time, and otherwise a progressively
more accurate approximation as queries are run. This would allow us to
hide the I/O in other normal block accesses.

Ken

In response to

Re: Improving N-Distinct estimation by ANALYZE at 2006-01-06 23:36:52 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2006-01-07 00:47:20	Warning on certain configuration file changes
Previous Message	Marko Kreen	2006-01-06 23:44:12	Re: [HACKERS] Inconsistent syntax in GRANT