Quick Links

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

From:	"Dave Held" <dave(dot)held(at)arraysg(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Date:	2005-04-27 14:35:32
Message-ID:	49E94D0CFCD4DB43AFBA928DDD20C8F9026184DF@asg002.asg.local
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> -----Original Message-----
> From: Greg Stark [mailto:gsstark(at)mit(dot)edu]
> Sent: Wednesday, April 27, 2005 1:00 AM
> To: Tom Lane
> Cc: Rod Taylor; Greg Stark; pgsql-hackers(at)postgresql(dot)org;
> Gurmeet Manku
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > Rod Taylor <pg(at)rbt(dot)ca> writes:
> > > If when we have partitions, that'll be good enough. If
> > > partitions aren't available this would be quite painful
> > > to anyone with large tables -- much as the days of old
> > > used to be painful for ANALYZE.
> >
> > Yeah ... I am very un-enthused about these suggestions to
> > make ANALYZE go back to doing a full scan ...

I don't see why ANALYZE would always have to do a full scan.
Clearly, this statistic would only be useful to people who need
a very accurate n_distinct on tables where the current metric
does not work very well. Applying a specialized solution to
every table doesn't seem like an efficient way to go about
things. Instead, the distinct sampling mechanism should be
purely optional, and probably purely separate from the vanilla
ANALYZE mechanism, because it works differently. If it were
designed that way, the full table scan would be a one-time
cost that would not even need to be paid if the user turned
on this mechanism at table creation. Thereafter, the statistic
would need to be updated incrementally, but that just
distributes the cost of ANALYZE over the INSERT/UPDATE/DELETEs.
Obviously, it's a higher cost because you touch every record
that hits the table, but that's the price you pay for a good
n_distinct.

The block estimator should probably become the default, since
it works within the current ANALYZE paradigm of sampling the
data.

> [...]
> For most use cases users have to run vacuum occasionally. In
> those cases "vacuum analyze" would be no worse than a straight
> normal vacuum.

And that's only if you do a full table scan every time. In the
incremental implementation, there are no lump sum costs involved
except when the statistic is first initialized.

> Note that this algorithm doesn't require storing more data
> because of the large scan or performing large sorts per
> column. It's purely O(n) time and O(1) space.

And I think it should be emphasized that distinct sampling not
only gives you a good n_distinct for query planning, it also
gives you a very fast approximate answer for related aggregate
queries. So you're getting more than just query tuning for that
cost.

> On the other hand, if you have tables you aren't vacuuming
> that means you perform zero updates or deletes. In which case
> some sort of incremental statistics updating would be a good
> solution. A better solution even than sampling.

Which, for the large data warehousing situations where it seems
this mechanism would be most useful, this would probably be the
most common case.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2005-04-27 15:19:34	Behavior of shared/exclusive row locks
Previous Message	Rod Taylor	2005-04-27 14:34:12	PITR and postmaster.pid