Re: More stable query plans via more predictable column statistics

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Stefan Litsche <stefan(dot)litsche(at)zalando(dot)de>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2015-12-04 17:48:33
Message-ID: CA+Tgmobh+u_Km5O5P-_5o9fVWrTK-5Gb17iVvDVvcmkibmtGbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
<oleksandr(dot)shulgin(at)zalando(dot)de> wrote:
> Hi Hackers!
>
> This post summarizes a few weeks of research of ANALYZE statistics
> distribution on one of our bigger production databases with some real-world
> data and proposes a patch to rectify some of the oddities observed.
>
>
> Introduction
> ============
>
> We have observed that for certain data sets the distribution of samples
> between most_common_vals and histogram_bounds can be unstable: so that it
> may change dramatically with the next ANALYZE run, thus leading to radically
> different plans.
>
> I was revisiting the following performance thread and I've found some
> interesting details about statistics in our environment:
>
>
> http://www.postgresql.org/message-id/flat/CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA(at)mail(dot)gmail(dot)com#CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com
>
> My initial interest was in evaluation if distribution of samples could be
> made more predictable and less dependent on the factor of luck, thus leading
> to more stable execution plans.
>
>
> Unexpected findings
> ===================
>
> What I have found is that in a significant percentage of instances, when a
> duplicate sample value is *not* put into the MCV list, it does produce
> duplicates in the histogram_bounds, so it looks like the MCV cut-off happens
> too early, even though we have enough space for more values in the MCV list.
>
> In the extreme cases I've found completely empty MCV lists and histograms
> full of duplicates at the same time, with only about 20% of distinct values
> in the histogram (as it turns out, this happens due to high fraction of
> NULLs in the sample).

Wow, this is very interesting work. Using values_cnt rather than
samplerows to compute avgcount seems like a clear improvement. It
doesn't make any sense to raise the threshold for creating an MCV
based on the presence of additional nulls or too-wide values in the
table. I bet compute_distinct_stats needs a similar fix. But for
plan stability considerations, I'd say we should back-patch this all
the way, but those considerations might mitigate for a more restrained
approach. Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

As far as the rest of the fix, your code seems to remove the handling
for ndistinct < 0. That seems unlikely to be correct, although it's
possible that I am missing something. Aside from that, the rest of
this seems like a policy change, and I'm not totally sure off-hand
whether it's the right policy. Having more MCVs can increase planning
time noticeably, and the point of the existing cutoff is to prevent us
from choosing MCVs that aren't actually "C". I think this change
significantly undermines those protections. It seems to me that it
might be useful to evaluate the effects of this part of the patch
separately from the samplerows -> values_cnt change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-12-04 17:50:09 Size of Path nodes
Previous Message Masahiko Sawada 2015-12-04 17:29:34 Re: Freeze avoidance of very large table.