Re: Make ANALYZE more selective about what is a "most common value"?

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Cc: marko(at)joh(dot)to
Subject: Re: Make ANALYZE more selective about what is a "most common value"?
Date: 2017-06-11 05:11:15
Message-ID: 80a466cb-de24-24d2-2c75-e14751950979@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/06/17 10:12, Gavin Flower wrote:

> On 06/06/17 09:41, Mark Kirkwood wrote:
>> On 05/06/17 09:30, Tom Lane wrote:
>>
>>> I've been thinking about the behavior discussed in
>>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
>>>
>>> and it seems to me that there are a couple of things we ought to do
>>> about
>>> it.
>>>
>>> First, I think we need a larger hard floor on the number of occurrences
>>> of a value that're required to make ANALYZE decide it is a "most common
>>> value". The existing coding is willing to believe that anything that
>>> appears at least twice in the sample is a potential MCV, but that
>>> design
>>> originated when we were envisioning stats samples of just a few
>>> thousand
>>> rows --- specifically, default_statistics_target was originally just
>>> 10,
>>> leading to a 3000-row sample size. So accepting two-appearance
>>> values as
>>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
>>> could be a tenth or a hundredth of that.
>>>
>>> As a round number, I'm thinking that a good floor would be a frequency
>>> estimate of 1/1000. With today's typical sample size of 30000 rows,
>>> a value would have to appear at least 30 times in the sample to be
>>> believed to be an MCV. That seems like it gives us a reasonable margin
>>> of error against the kind of sampling noise seen in the above-cited
>>> thread.
>>>
>>> Second, the code also has a rule that potential MCVs need to have an
>>> estimated frequency at least 25% larger than what it thinks the
>>> "average"
>>> value's frequency is. A rule of that general form seems like a good
>>> idea,
>>> but I now think the 25% threshold is far too small to do anything
>>> useful.
>>> In particular, in any case like this where there are more distinct
>>> values
>>> than there are sample rows, the "average frequency" estimate will
>>> correspond to less than one occurrence in the sample, so that this
>>> rule is
>>> totally useless to filter anything that we would otherwise consider
>>> as an
>>> MCV. I wonder if we shouldn't make it be "at least double the
>>> estimated
>>> average frequency".
>>>
>>
>> Or possibly calculate the sample standard deviation and make use of
>> that to help decide on a more flexible cutoff than twice the avg
>> frequency?
>>
>> Are there any research papers that might help us here (I'm drowning
>> in a sea of barely relevant search results for most phrases I've
>> tried so far)? I recall there were some that Tom referenced when this
>> stuff was originally written.
>>
>> On the other hand I do have access to some mathematicians
>> specializing in statistics - so can get their thoughts on this issue
>> if you feel it would be worthwhile.
>>
>> Cheers
>>
>> Mark
>>
>>
> The standard deviation (sd) is proportional to the square root of the
> number in the sample in a Normal Distribution.
>
> In a Normal Distribution, about 2/3 the values will be within plus or
> minus one sd of the mean.
>
> There seems to be an implicit assumption that the distribution of
> values follows the Normal Distribution - has this been verified? I
> suspect that real data will have a skewed distribution of values, and
> may even be multi modal (multiple peaks) The Normal Distribution has
> one central peak with 2 tails of the same shape & size.
>
> So in a sample of 100 the sd is proportional to 10%,
> for 10,000 the sd is proportional to 1%.
>
> So essentially, the larger the sample size the more reliable is
> knowledge of the most common values (ignoring pathologically extreme
> distributions!) - the measure of reliability depends on the distribution.
>
> How about selecting the cut off as the mean plus one sd, or something
> of that nature? Note that the cut off point may result in no mcv's
> being selected - especially for small samples.
>
> If practicable, it would be good to sample real datasets. Suggest
> looking at datasets were the current mechanism looks reasonable, and
> ones were the estimates are too far off. Also, if possible, try any
> new selection method on the datasets and see what the difference is.
>
> The above is based on what I remember from my university statistics
> papers, I took it up to 4th year level many moons ago.
>
>
>

With respect to the assumption of Normal distribution - it is easy to
come up with an example that shows it need not be: consider our our
Pgbench 'accounts' table. The distribution of 'bid' values is not Normal
(it is Uniform).

Now I realized I made people think about Normal distributions by
mentioning computing the standard deviation of the sample (and I
probably should have said 'standard deviation of the *sample
frequencies*') - but this was only a suggestion for working out how to
(maybe) be less arbitrary about how we decide what values to put in the
MCV list (currently 25% different from the mean frequency).

regards

Mark

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2017-06-11 06:25:25 RTE_NAMEDTUPLESTORE, enrtuples and comments
Previous Message Alvaro Herrera 2017-06-11 03:42:35 Re: Make ANALYZE more selective about what is a "most common value"?