Quick Links

Re: Make ANALYZE more selective about what is a "most common value"?

From:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject:	Re: Make ANALYZE more selective about what is a "most common value"?
Date:	2017-06-11 20:37:32
Message-ID:	CAEZATCV1oE7MW+yH79-=A74DX00ZJMwUq4ke2FN0d-fzAzxfMQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11 June 2017 at 20:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The standard way of doing this is to calculate the "standard error" of
>> the sample proportion - see, for example [3], [4]:
>> SE = sqrt(p*(1-p)/n)
>> Note, however, that this formula assumes that the sample size n is
>> small compared to the population size N, which is not necessarily the
>> case. This can be taken into account by applying the "finite
>> population correction" (see, for example [5]), which involves
>> multiplying by an additional factor:
>> SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
>
> It's been a long time since college statistics, but that wikipedia article
> reminds me that the binomial distribution isn't really the right thing for
> our problem anyway. We're doing sampling without replacement, so that the
> correct model is the hypergeometric distribution.

Yes that's right.

> The article points out
> that the binomial distribution is a good approximation as long as n << N.
> Can this FPC factor be justified as converting binomial estimates into
> hypergeometric ones, or is it ad hoc?

No, it's not just ad hoc. It comes from the variance of the
hypergeometric distribution [1] divided by the variance of a binomial
distribution [2] with p=K/N, in the notation of those articles.

This is actually a very widely used formula, used in fields like
analysis of survey data, which is inherently sampling without
replacement (assuming the questioners don't survey the same people
more than once!).

Regards,
Dean

[1] https://en.wikipedia.org/wiki/Hypergeometric_distribution
[2] https://en.wikipedia.org/wiki/Binomial_distribution

In response to

Re: Make ANALYZE more selective about what is a "most common value"? at 2017-06-11 19:19:25 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2017-06-11 21:08:10	Re: Buildfarm failures on woodlouse (in ecpg-check)
Previous Message	Tom Lane	2017-06-11 19:19:25	Re: Make ANALYZE more selective about what is a "most common value"?