Re: More stable query plans via more predictable column statistics

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-01-23 10:22:08
Message-ID: 56A35450.3020606@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 01/20/2016 10:49 PM, Alvaro Herrera wrote:
> Tom Lane wrote:
>> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
>>> This post summarizes a few weeks of research of ANALYZE statistics
>>> distribution on one of our bigger production databases with some real-world
>>> data and proposes a patch to rectify some of the oddities observed.
>>
>> Please add this to the 2016-01 commitfest ...
>
> Tom, are you reviewing this for the current commitfest?

While I'm not the right Tom, I've been looking the the patch recently,
so let me post the review here ...

Firstly, I'd like to appreciate the level of detail of the analysis. I
may disagree with some of the conclusions, but I wish all my patch
submissions were of such high quality.

Regarding the patch itself, I think there's a few different points
raised, so let me discuss them one by one:

1) NULLs vs. MCV threshold
--------------------------

I agree that this seems like a bug, and that we should really compute
the threshold only using non-NULL values. I think the analysis rather
conclusively proves this, and I also think there are places where we do
the same mistake (more on that at the end of the review).

2) mincount = 1.25 * avgcount
-----------------------------

While I share the dislike of arbitrary constants (like the 1.25 here), I
do think we better keep this, otherwise we can just get rid of the
mincount entirely I think - the most frequent value will always be above
the (remaining) average count, making the threshold pointless.

It might have impact in the original code, but in the new one it's quite
useless (see the next point), unless I'm missing some detail.

3) modifying avgcount threshold inside the loop
-----------------------------------------------

The comment was extended with this statement:

* We also decrease ndistinct in the process such that going forward
* it refers to the number of distinct values left for the histogram.

and indeed, that's what's happening - at the end of each loop, we do this:

/* Narrow our view of samples left for the histogram */
sample_cnt -= track[i].count;
ndistinct--;

but that immediately lowers the avgcount, as that's re-evaluated within
the same loop

avgcount = (double) sample_cnt / (double) ndistinct;

which means it's continuously decreasing and lowering the threshold,
although that's partially mitigated by keeping the 1.25 coefficient.

It however makes reasoning about the algorithm much more complicated.

4) for (i = 0; /* i < num_mcv */; i++)
---------------------------------------

The way the loop is coded seems rather awkward, I guess. Not only
there's an unexpected comment in the "for" clause, but the condition
also says this

/* Another way to say "while (i < num_mcv)" */
if (i >= num_mcv)
break;

Why not to write it as a while loop, then? Possibly including the
(num_hist >= 2) condition, like this:

while ((i < num_mcv) && (num_hist >= 2))
{
...
}

In any case, the loop would deserve a comment explaining why we think
computing the thresholds like this makes sense.

Summary
-------

Overall, I think this is really about deciding when to cut-off the MCV,
so that it does not grow needlessly large - as Robert pointed out, the
larger the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. we
believe we've seen all the values and we can fit them into the MCV
list), it may not make sense to do so. The ultimate goal is to estimate
conditions, and if we can do that reasonably even after cutting of the
least frequent values from the MCV list, then why not?

From this point of view, the analysis concentrates deals just with the
ANALYZE part and does not discuss the estimation counter-part at all.

5) ndistinct estimation vs. NULLs
---------------------------------

While looking at the patch, I started realizing whether we're actually
handling NULLs correctly when estimating ndistinct. Because that part
also uses samplerows directly and entirely ignores NULLs, as it does this:

numer = (double) samplerows *(double) d;

denom = (double) (samplerows - f1) +
(double) f1 *(double) samplerows / totalrows;

...
if (stadistinct > totalrows)
stadistinct = totalrows;

For tables with large fraction of NULLs, this seems to significantly
underestimate the ndistinct value - for example consider a trivial table
with 95% of NULL values and ~10k distinct values with skewed distribution:

create table t (id int);

insert into t
select (case when random() < 0.05 then (10000 * random() * random())
else null end) from generate_series(1,1000000) s(i);

In practice, there are 8325 distinct values in my sample:

test=# select count(distinct id) from t;
count
-------
8325
(1 row)

But after ANALYZE with default statistics target (100), ndistinct is
estimated to be ~1300, and with target=1000 the estimate increases to ~6000.

After fixing the estimator to consider fraction of NULLs, the estimates
look like this:

statistics target | master | patched
------------------------------------------
100 | 1302 | 5356
1000 | 6022 | 6791

So this seems to significantly improve the ndistinct estimate (patch
attached).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
ndistinct-nulls-fix.patch application/x-patch 1.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-01-23 10:40:07 Re: insert/update performance
Previous Message David Rowley 2016-01-23 09:44:08 Re: Removing Functionally Dependent GROUP BY Columns