Re: Make ANALYZE more selective about what is a "most common value"?

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Re: Make ANALYZE more selective about what is a "most common value"?
Date: 2017-06-11 20:37:32
Message-ID: CAEZATCV1oE7MW+yH79-=A74DX00ZJMwUq4ke2FN0d-fzAzxfMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 June 2017 at 20:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The standard way of doing this is to calculate the "standard error" of
>> the sample proportion - see, for example [3], [4]:
>> SE = sqrt(p*(1-p)/n)
>> Note, however, that this formula assumes that the sample size n is
>> small compared to the population size N, which is not necessarily the
>> case. This can be taken into account by applying the "finite
>> population correction" (see, for example [5]), which involves
>> multiplying by an additional factor:
>> SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
>
> It's been a long time since college statistics, but that wikipedia article
> reminds me that the binomial distribution isn't really the right thing for
> our problem anyway. We're doing sampling without replacement, so that the
> correct model is the hypergeometric distribution.

Yes that's right.

> The article points out
> that the binomial distribution is a good approximation as long as n << N.
> Can this FPC factor be justified as converting binomial estimates into
> hypergeometric ones, or is it ad hoc?

No, it's not just ad hoc. It comes from the variance of the
hypergeometric distribution [1] divided by the variance of a binomial
distribution [2] with p=K/N, in the notation of those articles.

This is actually a very widely used formula, used in fields like
analysis of survey data, which is inherently sampling without
replacement (assuming the questioners don't survey the same people
more than once!).

Regards,
Dean

[1] https://en.wikipedia.org/wiki/Hypergeometric_distribution
[2] https://en.wikipedia.org/wiki/Binomial_distribution

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2017-06-11 21:08:10 Re: Buildfarm failures on woodlouse (in ecpg-check)
Previous Message Tom Lane 2017-06-11 19:19:25 Re: Make ANALYZE more selective about what is a "most common value"?