Re: Index only scan sometimes switches to sequential scan for small amount of rows

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Feike Steenbergen <feikesteenbergen(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index only scan sometimes switches to sequential scan for small amount of rows
Date: 2015-03-26 07:48:34
Message-ID: CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Mar 25, 2015 at 1:00 PM, Feike Steenbergen <
feikesteenbergen(at)gmail(dot)com> wrote:

> On 25 March 2015 at 19:07, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> > Also, I doubt that that is the problem in the first place. If you
> collect a
> > sample of 30,000 (which the default target size of 100 does), and the
> > frequency of the second most common is really 0.00307333 at the time you
> > sampled it, you would expect to find it 92 times in the sample. The
> chances
> > against actually finding 1 instead of around 92 due to sampling error are
> > astronomical.
>
> It can be that the distribution of values is very volatile; we hope
> the increased stats target (from the default=100 to 1000 for this
> column) and frequent autovacuum and autoanalyze helps in keeping the
> estimates correct.
>
> It seems that it did find some other records (<> 'PRINTED), as is
> demonstrated in the stats where there was only one value in the MCV
> list: the frequency was 0.996567 and the fraction of nulls was 0,
> therefore leaving 0.03+ for other values. But because none of them
> were in the MCV and MCF list, they were all treated as equals. They
> are certainly not equal.
>
> I not know why some values were found (they are mentioned in the
> histogram_bounds), but are not part of the MCV list, as you say, the
> likeliness of only 1 item being found is very small.
>
> Does anyone know the criteria for a value to be included in the MCV list?
>

OK, this is starting to look like a long-standing bug to me.

If it only sees 3 distinct values, and all three are present at least
twice, it throws
all of them into the MCV list. But if one of those 3 were present just
once, then it
tests them to see if they qualify. The test for inclusion is that it has
to be present more than once, and that it must be "over-represented" by 25%.

Lets say it sampled 30000 rows and found 29,900 of one value, 99 of
another, and 1 of a third.

But that turns into the second one needing to be present 12,500 times. The
average value is present 10,000 times (30,000 samples with 3 distinct
values) and 25 more than that is 12,500. So it excluded.

It seems to me that a more reasonable criteria is that it must be
over-represented 25% compared to the average of all the remaining values
not yet accepted into the MCV list. I.e. all the greater ones should be
subtracted out before computing the over-representation threshold.

It is also grossly inconsistent with the other behavior. If they are
"29900; 98; 2" then all three go into the MCV.
If they are "29900; 99; 1" then only the highest one goes in. The second
one gets evicted for being slightly *more* popular.

This is around line 2605 of src/backend/commands/analyze.c in head.

Cheers,

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Feike Steenbergen 2015-03-26 08:17:55 Re: Index only scan sometimes switches to sequential scan for small amount of rows
Previous Message Jeff Janes 2015-03-25 21:45:47 Re: Index only scan sometimes switches to sequential scan for small amount of rows