Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Mischa Sandberg <mischa(dot)sandberg(at)telus(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>,pgsql-perform <pgsql-performance(at)postgresql(dot)org>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-28 15:21:36
Message-ID: 1114701696.4270ff80d577c@webmail.telus.net (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
Quoting Josh Berkus <josh(at)agliodbs(dot)com>:

> > >Perhaps I can save you some time (yes, I have a degree in Math). If I
> > >understand correctly, you're trying extrapolate from the correlation
> > >between a tiny sample and a larger sample. Introducing the tiny sample
> > >into any decision can only produce a less accurate result than just
> > >taking the larger sample on its own; GIGO. Whether they are consistent
> > >with one another has no relationship to whether the larger sample
> > >correlates with the whole population. You can think of the tiny sample
> > >like "anecdotal" evidence for wonderdrugs.
>
> Actually, it's more to characterize how large of a sample we need.  For
> example, if we sample 0.005 of disk pages, and get an estimate, and then
> sample another 0.005 of disk pages and get an estimate which is not even
> close to the first estimate, then we have an idea that this is a table
which
> defies analysis based on small samples.   Wheras if the two estimates
are <
> 1.0 stdev apart, we can have good confidence that the table is easily
> estimated.  Note that this doesn't require progressively larger
samples; any
> two samples would work.

We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?

About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.

-- 
"Dreams come true, not free." -- S.Sondheim, ITW 


In response to

pgsql-performance by date

Next:From: Marko RistolaDate: 2005-04-28 17:44:37
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous:From: Mischa SandbergDate: 2005-04-28 15:00:53
Subject: Re: Suggestions for a data-warehouse migration routine

pgsql-hackers by date

Next:From: Kris JurkaDate: 2005-04-28 15:22:01
Subject: Re: Statement Timeout and Locking
Previous:From: Robert TreatDate: 2005-04-28 15:12:20
Subject: Re: [HACKERS] Increased company involvement

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group