Re: More stable query plans via more predictable column statistics

From: Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-04-03 20:14:38
Message-ID: CAM-UEKS=S0hVRTJXBoYvFfJgK3=sv4YvUDn=FEoktaEnRzK2nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 3, 2016 at 8:24 AM, Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com> wrote:
>
> On Sun, Apr 3, 2016 at 7:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Alex Shulgin <alex(dot)shulgin(at)gmail(dot)com> writes:
>> > On Sun, Apr 3, 2016 at 7:18 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> Well, we have to do *something* with the last (possibly only) value.
>> >> Neither "include always" nor "omit always" seem sane to me. What
other
>> >> decision rule do you want there?
>>
>> > Well, what implies that the last value is somehow special? I would
think
>> > we should just do with it whatever we do with the rest of the candidate
>> > MCVs.
>>
>> Sure, but both of the proposed decision rules break down when there are
no
>> values after the one under consideration. We need to do something sane
>> there.
>
>
> Hm... There are indeed the case where it would beneficial to have at
least 2 values in the histogram (to have at least the low/high bounds for
inequality comparison selectivity) instead of taking both to the MCV list
or taking one to the MCVs and having to discard the other.

I was thinking about this in the background...

Popularity of the last sample value (which is not the only) one can be:

a) As high as 50%, in case we have an even division between the only two
values in the sample. Quite obviously, we should take this one into the
MCV list (well, unless the user has specified statistics_target of 1 for
some bizarre reason, but that should not be our problem).

b) As low as 2/(statistics_target*300), which is with the target set to a
maximum allowed value of 10,000 amounts to 2/(10,000*300) = 1 in
1,500,000. This seems like a really tiny number, but if your table has
some tens of billions of rows, for example, seeing such a value at least
twice means that it might correspond to some thousands of rows in the
table, whereas seeing a value only once might mean just that: it's a unique
value.

In this case, putting such a duplicate value in the MCV list will allow a
much better selectivity estimate for equality comparison, as I've mentioned
earlier. It also allows for better estimate with inequality comparison,
since MCVs are also consulted in this case. I see no good reason to
discard such a value.

c) Or anything in between the above figures.

In my opinion that amounts to "include always" being the sane option. Do
you see anything else as a problem here?

> Obviously, we need a fresh idea on how to handle this.

On reflection, the case where we have a duplicate value in the track list
which is not followed by any other sample should be covered by the short
path where we put all the tracked values in the MCV list, so there should
be no point to even consider all of the above!

But the exact short path condition is formulated like this:

if (track_cnt == ndistinct && toowide_cnt == 0 &&
stats->stadistinct > 0 &&
track_cnt <= num_mcv)
{
/* Track list includes all values seen, and all will fit */

So the execution path here is additionally put in dependence of two
factors: whether we've seen at least one too wide sample or the distinct
estimation produced a number higher than 10% of the estimated total table
size (which is yet another arbitrary limit, but that's not in scope of this
patch).

I've been puzzled by these conditions a lot, as I have mentioned in the
last section of this thread's starting email and I could not find anything
that would hint why they exist there, in the documentation, code comments
or emails on hackers leading to the introduction of analyze.c in the form
we know it today. Probably we will never know, unless Tom still has some
notes on this topic from 15 years ago. ;-)

This recalled observation can now also explain to me why in the regression
you've seen, the short path was not followed: my bet is that stadistinct
appeared negative.

Given that we change the logic in the complex path substantially, the
assumptions that lead to the "Take all MVCs" condition above might no
longer hold, and I see it as a pretty compelling argument to remove the
extra checks, thus keeping the only one: track_cnt == ndistinct. This
should also bring the patch's effect more close to the thread's topic,
which is "More stable query plans".

Regards,
--
Alex

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-04-03 20:26:14 Re: psql metaqueries with \gexec
Previous Message Simon Riggs 2016-04-03 20:06:06 Re: PATCH: use foreign keys to improve join estimates v1