Re: BUG #14664: Nonsensical join selectivity estimation despite n_distinct

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: marko(at)joh(dot)to, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14664: Nonsensical join selectivity estimation despite n_distinct
Date: 2017-06-12 16:20:41
Message-ID: CAMkU=1y6tz1oygO0Z-caDUy1P1msT8ZM4RY9rujOF7wwCvkemg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Jun 3, 2017 at 1:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
> If I increase the stats target to 1000, I get results like
>
> regression=# select f, count(*) from (select unnest(most_common_freqs) f
> from pg_stats where tablename = 'qqq' and attname = 'other') ss group by 1
> order by 1;
> f | count
> -------------+-------
> 6.66667e-06 | 935
> 1e-05 | 65
> (2 rows)
>
> regression=# select f, count(*) from (select unnest(most_common_freqs) f
> from pg_stats where tablename = 'qqq' and attname = 'other') ss group by 1
> order by 1;
> f | count
> -------------+-------
> 6.66667e-06 | 936
> 1e-05 | 63
> 1.33333e-05 | 1
> (3 rows)
>
> (the counts move around a bit, but not very much). That corresponds to a
> minimum of 2 and a maximum of either 3 or 4 duplicates in a 300000-row
> sample.
>
> Given that we know that the duplicate values of "other" are adjacent,
> what this is looking like to me is that there's something fishy about
> ANALYZE's choice of sample rows. It seems like it should not be picking 3

or 4 rows from the same block very often, and for all of those to come
> from the same 10-row span (out of 225 rows per page, more or less, in
> this table) seems like it ought to be a really low-probability event.
> But maybe my math is off.
>

Yeah, first it samples 30,000 blocks, then it samples 30,000 rows from
those blocks. This leads to too many blocks getting zero rows sampled, too
many getting more than one sampled, and too few getting exactly one sampled.

Basically, the variance for the poisson distribution for selected rows per
block for sampling 30,000 row from 30,000 blocks is much broader than it is
for sampling 30,000 rows from >> 30,000 blocks, so the 2 stage sampling
method doesn't work very well.

Tomas was working on a patch a while back for fixing this by resampling
from blocks, but I think he dropped it to work on multivariate stuff
instead.

Cheers,

Jeff

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Neil Anderson 2017-06-12 16:24:07 Re: BUG #14701: pg_dump fails to dump pg_catalog schema
Previous Message Tom Lane 2017-06-12 16:18:06 Re: BUG #14701: pg_dump fails to dump pg_catalog schema