Re: MCV lists for highly skewed distributions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-03-17 17:16:14
Message-ID: CAEZATCUb5xY-Q9g2hebL35RU4E4SURYd-v7nHrQGAO6xFr+9hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13 March 2018 at 08:39, John Naylor <jcnaylor(at)gmail(dot)com> wrote:
>> Also, this is common enough that in fact that distribution
>> can be reasonably approximated by a normal distribution.
>
> For the archives, because it's typically seen 10 times in the sample,
> per the rule of thumb mention upthread?
>

Actually, I think that the rule of thumb of at least 10 instances in
the sample for a normal approximation is probably too strict.

Looking around, values as low as 5 seem to be quite commonly used. Of
course there is no right answer here, it depends on what you're doing
with it, and how close you want the normal distribution to be to the
hypergeometric one. For our purposes, I don't think we really need it
to be that close. We just want to be able to justify statements like
"the value is *probably* more common than X, and it was unlikely that
that was just random sampling error".

>> Making use of
>> the non-MCV-based estimate allows us to do better -- if the MCV-based
>> estimate is statistically significantly higher than the non-MCV-based
>> estimate, then it would almost certainly be better to include that
>> value in the MCV list, even if its spread is quite large.
>
> Because we can treat the sample as normal, testing for > 2 standard
> deviations means that we're "~95% sure" it's more common in the table
> than the non-MCV selectivity would suggest, right? (I realize I'm not
> using technically accurate language)
>

Actually, it's more like 97.5% (remember the normal distribution is
symmetric, so there's around 2.5% below the 2-stddev interval, around
95% in it, and another 2.5% above it), but that's just nit-picking.

There are several nice online calculators for playing around with
hypergeometric distributions, such as
http://keisan.casio.com/exec/system/1180573201

Consider this distribution:

N = 1000000
n = 10000
K = 500
Mean = n * K / N = 5
Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9
Stddev = sqrt(Variance) = 2.2

Using the calculator above, you can see that the distribution is
fairly normal-like, but with a noticeable positive skew. The 2-stddev
interval is 0.6 to 9.4, and according to the calculator the
probability of the value being less than or equal to 1 is in the
ballpark of the 2.5% figure expected. So even with just 5 occurrences
in the sample, it's fairly close to a normal distribution.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Meskes 2018-03-17 17:16:19 Re: ECPG oracle mode test program patch
Previous Message Christos Maris 2018-03-17 16:12:42 Re: Google Summer of Code: Potential Applicant