Re: More stable query plans via more predictable column statistics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-04-03 23:06:49
Message-ID: 12927.1459724809@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com> writes:
> On Sun, Apr 3, 2016 at 10:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The reason for checking toowide_cnt is that if it's greater than zero,
>> then in fact the track list does NOT include all values seen, and it's
>> flat-out wrong to claim that it is an exhaustive set of values.

> But do we really state that with the short path?

Well, yes: the point of the short path is that we're hypothesizing that
the track list contains all values in the table, and that they should all
be considered MCVs. Another way to think about it is that if we didn't
have the width-limit implementation restriction, those values would appear
in the track list, almost certainly with count 1, and so we would not
have taken the short path anyway.

Now you can argue that the long path would have accepted all the real
track-list entries as MCVs, and have rejected all these hypothetical
count-1 entries for too-wide values, and so the end result would be the
same. But that gets back to the fact that that's not necessarily how
the long path behaves, either today or with the proposed patch.

The design intention was that the short path would handle columns
with a finite, small set of values (think booleans or enums) where the
ideal thing is that the table population is completely represented by
the MCV list. As soon as we have good reason to think that the MCV
list won't represent the table contents fully, we should switch over
to a different approach where we're trying to identify which sample
values are common enough to justify putting in the MCV list. In that
situation there are good reasons to not blindly fill the MCV list all
the way to the stats-target limit, but to try to cut it off at the
point of diminishing returns, so that the planner isn't saddled with
a huge MCV list that doesn't really contain a lot of useful information.

So that's the logic behind there being two code paths with discontinuous
behavior. I'm not sure whether we need to try to narrow the discontinuity
or whether it's OK to act that way and we just need to refine the decision
rule about which path to take. But anyway, comparisons of frequencies
of candidate MCVs seem to me to make sense in a large-ndistinct scenario
(where we have to be selective) but not a small-ndistinct scenario
(where we should just take 'em all).

>> The point of the original logic was to try to decide whether the
>> values in the sample are significantly more common than typical values
>> in the whole table population. I think we may have broken that with
>> 3d3bf62f3: you can't make any such determination if you consider only
>> what's in the sample without trying to estimate what is not in the
>> sample.

> Speaking of rabbit holes...
> I'm out of ideas, unfortunately. We badly need more eyes/brainpower on
> this, which is why I have submitted a talk proposal on this topic to
> PGDay.ru this summer in St. Petersburg, fearing that it might be too late
> to commit a satisfactory version during the current dev cycle for 9.6, and
> in hope to draw at least some attention to it.

If you're thinking it's too late to get more done for 9.6, I'm inclined to
revert the aspect of 3d3bf62f3 that made us work from "d" (the observed
number of distinct values in the sample) rather than stadistinct (the
extrapolated estimate for the table). On reflection I think that that's
inconsistent with the theory behind the old MCV-cutoff rule. It wouldn't
matter if we were going to replace the cutoff rule with something else,
but it's beginning to sound like that won't happen for 9.6.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2016-04-03 23:08:28 Re: Using quicksort for every external sort run
Previous Message Corey Huinker 2016-04-03 22:18:40 Re: psql metaqueries with \gexec