Re: More stable query plans via more predictable column statistics

From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-02-08 14:01:00
Message-ID: CACACo5R_+UZcDPbF38_yG7E7W1zA=Bm9U8BR_wvrLchDELkR7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 25, 2016 at 5:11 PM, Shulgin, Oleksandr <
oleksandr(dot)shulgin(at)zalando(dot)de> wrote:
>
> On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <
tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>
>> Overall, I think this is really about deciding when to cut-off the MCV,
so that it does not grow needlessly large - as Robert pointed out, the
larger the list, the more expensive the estimation (and thus planning).
>>
>> So even if we could fit the whole sample into the MCV list (i.e. we
believe we've seen all the values and we can fit them into the MCV list),
it may not make sense to do so. The ultimate goal is to estimate
conditions, and if we can do that reasonably even after cutting of the
least frequent values from the MCV list, then why not?
>>
>> From this point of view, the analysis concentrates deals just with the
ANALYZE part and does not discuss the estimation counter-part at all.
>
>
> True, this aspect still needs verification. As stated, my primary
motivation was to improve the plan stability for relatively short MCV lists.
>
> Longer MCV lists might be a different story, but see "Increasing stats
target" section of the original mail: increasing the target doesn't give
quite the expected results with unpatched code either.

To address this concern I've run my queries again on the same dataset, now
focusing on how the number of MCV items changes with the patched code
(using the CTEs from my original mail):

WITH ...

SELECT count(1),
min(num_mcv)::real,
avg(num_mcv)::real,
max(num_mcv)::real,
stddev(num_mcv)::real

FROM stats2

WHERE num_mcv IS NOT NULL;

(ORIGINAL)
count | 27452
min | 1
avg | 32.7115
max | 100
stddev | 40.6927

(PATCHED)
count | 27527
min | 1
avg | 38.4341
max | 100
stddev | 43.3596

A significant portion of the MCV lists is occupying all 100 slots available
with the default statistics target, so it also interesting to look at the
stats that habe "underfilled" MCV lists (by changing the condition of the
WHERE clause to read "num_mcv < 100"):

(<100 ORIGINAL)
count | 20980
min | 1
avg | 11.9541
max | 99
stddev | 18.4132

(<100 PATCHED)
count | 19329
min | 1
avg | 12.3222
max | 99
stddev | 19.6959

As one can see, with the patched code the average length of MCV lists
doesn't change all that dramatically, while at the same time exposing all
the improvements described in the original mail.

>> After fixing the estimator to consider fraction of NULLs, the estimates
look like this:
>>
>> statistics target | master | patched
>> ------------------------------------------
>> 100 | 1302 | 5356
>> 1000 | 6022 | 6791
>>
>> So this seems to significantly improve the ndistinct estimate (patch
attached).
>
>
> Hm... this looks correct. And compute_distinct_stats() needs the same
treatment, obviously.

I've incorporated this fix into the v2 of my patch, I think it is related
closely enough. Also, added corresponding changes to
compute_distinct_stats(), which doesn't produce a histogram.

I'm adding this to the next CommitFest. Further reviews are very much
appreciated!

--
Alex

Attachment Content-Type Size
analyze-better-histogram-v2.patch text/x-patch 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-02-08 14:24:29 Re: 9.5 new setting "cluster name" and logging
Previous Message Ashutosh Bapat 2016-02-08 13:41:56 Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)