From: | "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Stefan Litsche <stefan(dot)litsche(at)zalando(dot)de> |
Subject: | Re: More stable query plans via more predictable column statistics |
Date: | 2015-12-07 10:19:34 |
Message-ID: | CACACo5TvrNZcB_K2zc2CbcgL4xXGR55re1G4gp9TtqL++WBvCg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Dec 4, 2015 at 6:48 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
> <oleksandr(dot)shulgin(at)zalando(dot)de> wrote:
> >
> > What I have found is that in a significant percentage of instances, when
> a
> > duplicate sample value is *not* put into the MCV list, it does produce
> > duplicates in the histogram_bounds, so it looks like the MCV cut-off
> happens
> > too early, even though we have enough space for more values in the MCV
> list.
> >
> > In the extreme cases I've found completely empty MCV lists and histograms
> > full of duplicates at the same time, with only about 20% of distinct
> values
> > in the histogram (as it turns out, this happens due to high fraction of
> > NULLs in the sample).
>
> Wow, this is very interesting work. Using values_cnt rather than
> samplerows to compute avgcount seems like a clear improvement. It
> doesn't make any sense to raise the threshold for creating an MCV
> based on the presence of additional nulls or too-wide values in the
> table. I bet compute_distinct_stats needs a similar fix.
Yes, and there's also the magic 1.25 multiplier in that code. I think it
would make sense to agree first on how exactly the patch for
compute_scalar_stats() should look like, then port the relevant bits to
compute_distinct_stats().
> But for
> plan stability considerations, I'd say we should back-patch this all
> the way, but those considerations might mitigate for a more restrained
> approach. Still, maybe we should try to sneak at least this much into
> 9.5 RSN, because I have to think this is going to help people with
> mostly-NULL (or mostly-really-wide) columns.
>
I'm not sure. Likely people would complain or have found this out on their
own if they were seriously affected.
What I would be interested is people running the queries I've shown on
their data to see if there are any interesting/unexpected patterns.
As far as the rest of the fix, your code seems to remove the handling
> for ndistinct < 0. That seems unlikely to be correct, although it's
> possible that I am missing something.
The difference here is that ndistinct at this scope in the original code
did hide a variable from an outer scope. That one could be < 0, but in my
code there is no inner-scope ndistinct, we are referring to the outer scope
variable which cannot be < 0.
> Aside from that, the rest of
> this seems like a policy change, and I'm not totally sure off-hand
> whether it's the right policy. Having more MCVs can increase planning
> time noticeably, and the point of the existing cutoff is to prevent us
> from choosing MCVs that aren't actually "C". I think this change
> significantly undermines those protections. It seems to me that it
> might be useful to evaluate the effects of this part of the patch
> separately from the samplerows -> values_cnt change.
>
Yes, that's why I was wondering if frequency cut-off approach might be
helpful here. I'm going to have a deeper look at array's typanalyze
implementation at the least.
--
Alex
From | Date | Subject | |
---|---|---|---|
Next Message | YUriy Zhuravlev | 2015-12-07 11:29:14 | Re: Some questions about the array. |
Previous Message | Ildus Kurbangaliev | 2015-12-07 09:41:41 | Re: Support of partial decompression for datums |