Re: Index only scan sometimes switches to sequential scan for small amount of rows

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Feike Steenbergen <feikesteenbergen(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index only scan sometimes switches to sequential scan for small amount of rows
Date: 2015-03-25 18:07:12
Message-ID: CAMkU=1wQB8iLh-TvH9rQrSW-a4HgYO2RQRya=tFdAo8esD9qfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Mar 25, 2015 at 9:07 AM, Feike Steenbergen <
feikesteenbergen(at)gmail(dot)com> wrote:

> I'm posting this as I am trying to understand what has happened.
> TLDR: The problem seems to be fixed now.
>
> By bumping the statistics_target we see that most_common_vals is
> having its contents filled more often, causing way better estimates:
>
> attname | status
> inherited | f
> null_frac | 0
> avg_width | 4
> n_distinct | 3
> most_common_vals | {PRINTED,PREPARED,ERROR}
> most_common_freqs | {0.996863,0.00307333,6.33333e-05}
> histogram_bounds | (null)
> correlation | 0.98207
> most_common_elems | (null)
> most_common_elem_freqs | (null)
> elem_count_histogram | (null)
>
> Basically 100% of the records are accounted for in these statistics,
> the planner now consistently estimates the number of rows to be very
> small for other values.
>
> Before bumping the target we didn't have information for 0.34% of the
> rows, which in this case means roughly 11K rows.
>
> What is the reasoning behind having at least 2 hits before including
> it in the most_common_* columns?
>

If you sample a small portion of the table, then anything only present once
is going to be have a huge uncertainty on its estimate.

Consider the consequences of including things sampled once. 100% of the
rows that got sampled will be in the sample at least once. That means
most_common_freqs will always sum to 100%. Which means we are declaring
that anything not observed in the sample has a frequency of
0.000000000000%, which is clearly beyond what we have any reasonable
evidence to support.

Also, I doubt that that is the problem in the first place. If you collect
a sample of 30,000 (which the default target size of 100 does), and the
frequency of the second most common is really 0.00307333 at the time you
sampled it, you would expect to find it 92 times in the sample. The chances
against actually finding 1 instead of around 92 due to sampling error are
astronomical.

The problem seems to be rapidly changing stats, not too small of a target
size (unless your original target size was way below the current default
value, forgive me if you already reported that, I didn't see it anywhere).

If you analyze the table at a point when it is 100% PRINTED, there is no
way of knowing based on that analysis alone what the distribution of
!='PRINTED' would be, should such values ever arise.

Maybe it would work better if you built the partial index where status =
'NOT_YET_PRINTED', instead of !='PRINTED'.

Cheers,

Jeff

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Feike Steenbergen 2015-03-25 20:00:44 Re: Index only scan sometimes switches to sequential scan for small amount of rows
Previous Message Tom Lane 2015-03-25 17:26:20 Re: Index only scan sometimes switches to sequential scan for small amount of rows