Quick Links

Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From:	Mischa Sandberg <mischa(dot)sandberg(at)telus(dot)net>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date:	2005-04-28 15:21:36
Message-ID:	1114701696.4270ff80d577c@webmail.telus.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

Quoting Josh Berkus <josh(at)agliodbs(dot)com>:

> > >Perhaps I can save you some time (yes, I have a degree in Math). If I
> > >understand correctly, you're trying extrapolate from the correlation
> > >between a tiny sample and a larger sample. Introducing the tiny sample
> > >into any decision can only produce a less accurate result than just
> > >taking the larger sample on its own; GIGO. Whether they are consistent
> > >with one another has no relationship to whether the larger sample
> > >correlates with the whole population. You can think of the tiny sample
> > >like "anecdotal" evidence for wonderdrugs.
>
> Actually, it's more to characterize how large of a sample we need. For
> example, if we sample 0.005 of disk pages, and get an estimate, and then
> sample another 0.005 of disk pages and get an estimate which is not even
> close to the first estimate, then we have an idea that this is a table
which
> defies analysis based on small samples. Wheras if the two estimates
are <
> 1.0 stdev apart, we can have good confidence that the table is easily
> estimated. Note that this doesn't require progressively larger
samples; any
> two samples would work.

We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?

About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.

--
"Dreams come true, not free." -- S.Sondheim, ITW

In response to

Re: [HACKERS] Bad n_distinct estimation; hacks suggested? at 2005-04-27 15:25:16 from Josh Berkus

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kris Jurka	2005-04-28 15:22:01	Re: Statement Timeout and Locking
Previous Message	Robert Treat	2005-04-28 15:12:20	Re: [HACKERS] Increased company involvement

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Marko Ristola	2005-04-28 17:44:37	Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous Message	Mischa Sandberg	2005-04-28 15:00:53	Re: Suggestions for a data-warehouse migration routine