Re: More stable query plans via more predictable column statistics

From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Stefan Litsche <stefan(dot)litsche(at)zalando(dot)de>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2015-12-07 10:19:34
Message-ID: CACACo5TvrNZcB_K2zc2CbcgL4xXGR55re1G4gp9TtqL++WBvCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 4, 2015 at 6:48 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
> <oleksandr(dot)shulgin(at)zalando(dot)de> wrote:
> >
> > What I have found is that in a significant percentage of instances, when
> a
> > duplicate sample value is *not* put into the MCV list, it does produce
> > duplicates in the histogram_bounds, so it looks like the MCV cut-off
> happens
> > too early, even though we have enough space for more values in the MCV
> list.
> >
> > In the extreme cases I've found completely empty MCV lists and histograms
> > full of duplicates at the same time, with only about 20% of distinct
> values
> > in the histogram (as it turns out, this happens due to high fraction of
> > NULLs in the sample).
>
> Wow, this is very interesting work. Using values_cnt rather than
> samplerows to compute avgcount seems like a clear improvement. It
> doesn't make any sense to raise the threshold for creating an MCV
> based on the presence of additional nulls or too-wide values in the
> table. I bet compute_distinct_stats needs a similar fix.

Yes, and there's also the magic 1.25 multiplier in that code. I think it
would make sense to agree first on how exactly the patch for
compute_scalar_stats() should look like, then port the relevant bits to
compute_distinct_stats().

> But for
> plan stability considerations, I'd say we should back-patch this all
> the way, but those considerations might mitigate for a more restrained
> approach. Still, maybe we should try to sneak at least this much into
> 9.5 RSN, because I have to think this is going to help people with
> mostly-NULL (or mostly-really-wide) columns.
>

I'm not sure. Likely people would complain or have found this out on their
own if they were seriously affected.

What I would be interested is people running the queries I've shown on
their data to see if there are any interesting/unexpected patterns.

As far as the rest of the fix, your code seems to remove the handling
> for ndistinct < 0. That seems unlikely to be correct, although it's
> possible that I am missing something.

The difference here is that ndistinct at this scope in the original code
did hide a variable from an outer scope. That one could be < 0, but in my
code there is no inner-scope ndistinct, we are referring to the outer scope
variable which cannot be < 0.

> Aside from that, the rest of
> this seems like a policy change, and I'm not totally sure off-hand
> whether it's the right policy. Having more MCVs can increase planning
> time noticeably, and the point of the existing cutoff is to prevent us
> from choosing MCVs that aren't actually "C". I think this change
> significantly undermines those protections. It seems to me that it
> might be useful to evaluate the effects of this part of the patch
> separately from the samplerows -> values_cnt change.
>

Yes, that's why I was wondering if frequency cut-off approach might be
helpful here. I'm going to have a deeper look at array's typanalyze
implementation at the least.

--
Alex

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message YUriy Zhuravlev 2015-12-07 11:29:14 Re: Some questions about the array.
Previous Message Ildus Kurbangaliev 2015-12-07 09:41:41 Re: Support of partial decompression for datums