Re: More stable query plans via more predictable column statistics

From: Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-04-03 23:47:13
Message-ID: CAM-UEKRnHmE-LHao2MqC86bME6-e_-jS0su=HroAkxw4mEJojg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 4, 2016 at 1:06 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com> writes:
> > On Sun, Apr 3, 2016 at 10:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> The reason for checking toowide_cnt is that if it's greater than zero,
> >> then in fact the track list does NOT include all values seen, and it's
> >> flat-out wrong to claim that it is an exhaustive set of values.
>
> > But do we really state that with the short path?
>
> Well, yes: the point of the short path is that we're hypothesizing that
> the track list contains all values in the table, and that they should all
> be considered MCVs. Another way to think about it is that if we didn't
> have the width-limit implementation restriction, those values would appear
> in the track list, almost certainly with count 1, and so we would not
> have taken the short path anyway.
>
> Now you can argue that the long path would have accepted all the real
> track-list entries as MCVs, and have rejected all these hypothetical
> count-1 entries for too-wide values, and so the end result would be the
> same. But that gets back to the fact that that's not necessarily how
> the long path behaves, either today or with the proposed patch.
>

Agreed.

The design intention was that the short path would handle columns
> with a finite, small set of values (think booleans or enums) where the
> ideal thing is that the table population is completely represented by
> the MCV list. As soon as we have good reason to think that the MCV
> list won't represent the table contents fully, we should switch over
> to a different approach where we're trying to identify which sample
> values are common enough to justify putting in the MCV list.

This is a precious detail that I unfortunately couldn't find in any of the
sources of information available to me online. :-)

I don't have a habit of hanging on IRC channels, but now I wonder how
likely is it that I could learn this by just asking around on #postgresql
(or mailing you directly as the committer of this early implementation--is
that OK at all?)

Again, having this type of design decisions documented in the code might
save some time and confusion for the sociopath^W introvert-type of folks
like myself. ;-)

In that
> situation there are good reasons to not blindly fill the MCV list all
> the way to the stats-target limit, but to try to cut it off at the
> point of diminishing returns, so that the planner isn't saddled with
> a huge MCV list that doesn't really contain a lot of useful information.
>

This came to be my understanding also at some point.

So that's the logic behind there being two code paths with discontinuous
> behavior. I'm not sure whether we need to try to narrow the discontinuity
> or whether it's OK to act that way and we just need to refine the decision
> rule about which path to take. But anyway, comparisons of frequencies
> of candidate MCVs seem to me to make sense in a large-ndistinct scenario
> (where we have to be selective) but not a small-ndistinct scenario
> (where we should just take 'em all).
>

Yeah, this seems to be an open question. And a totally new one to me in
the light of recent revelations.

>> The point of the original logic was to try to decide whether the
> >> values in the sample are significantly more common than typical values
> >> in the whole table population. I think we may have broken that with
> >> 3d3bf62f3: you can't make any such determination if you consider only
> >> what's in the sample without trying to estimate what is not in the
> >> sample.
>
> > Speaking of rabbit holes...
> > I'm out of ideas, unfortunately. We badly need more eyes/brainpower on
> > this, which is why I have submitted a talk proposal on this topic to
> > PGDay.ru this summer in St. Petersburg, fearing that it might be too late
> > to commit a satisfactory version during the current dev cycle for 9.6,
> and
> > in hope to draw at least some attention to it.
>
> If you're thinking it's too late to get more done for 9.6,

Not necessarily, but given the time constraints and some personal issues
that just keep popping up I'm not that optimistic as I was just 24h ago
anymore.

> I'm inclined to
> revert the aspect of 3d3bf62f3 that made us work from "d" (the observed
> number of distinct values in the sample) rather than stadistinct (the
> extrapolated estimate for the table). On reflection I think that that's
> inconsistent with the theory behind the old MCV-cutoff rule. It wouldn't
> matter if we were going to replace the cutoff rule with something else,
> but it's beginning to sound like that won't happen for 9.6.
>

Please feel free to do what you think is in the best interest of the people
preparing 9.6 for the freeze. I'm not all that familiar with the process,
but I guess reverting this early might save some head-scratching if some
interesting interactions of this change combined with some others are going
to show up.

Cheers!
--
Alex

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-04-04 00:14:24 Re: pgbench more operators & functions
Previous Message Tom Lane 2016-04-03 23:43:59 Re: psql metaqueries with \gexec