Re: MCV lists for highly skewed distributions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-03-18 22:52:30
Message-ID: CAEZATCWs_DdJb+4fppwqcExx7j+X2uP_GyPoymLvsaeNH1mW6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 March 2018 at 12:24, John Naylor <jcnaylor(at)gmail(dot)com> wrote:
> Over the weekend I tried out a test to measure:
> -The size of the MCV list
> -The ratio between actual and estimated cardinality of the least
> common value in the MCV list.
> -The ratio between actual and estimated cardinality of the most common
> value not in the MCV list.

Nice tests!

> Summary for non-uniform distributions:
> Note: There's a lot of variation in the estimation of the most common
> non-MCV. About 1% of all ANALYZE runs apparently failed to sample one
> of the good candidates for the MCV list, resulting in huge
> underestimates. It didn't matter what patch was applied. For future
> tests, I'll do more runs and measure a truncated mean. Looking at the
> number of MCVs is much more stable, but unlike the uniform
> distribution case, we don't have an intuitive feel for what that
> number should be. That said,
>
> -The V2 patch is somewhat better than RSE and much better than master
> at estimating the most common value not in the MCV list.

Yes, that matches my observations too. It stems from the fact that the
V2 patch tends to produce more MCVs than the RSE patch (which in turn
tends to produce more MCVs than master) for non-uniform distributions.
More MCVs then improve the estimates for non-MCVs.

> -Bumping the sample stddev threshold to 3 seems to behave similarly to
> the RSE patch, maybe slightly better.

In a way, that's probably not surprising. The two algorithms are
closely related. If the non-MCV frequencies are quite low, the v2
patch with a 3-SD threshold behaves like the RSE patch with a 33% RSE
cutoff. The 20% RSE cutoff patch is mathematically equivalent to the
v2 patch with a 5-SD threshold and the non-MCV frequencies all set to
zero, if that makes sense.

> Summary for uniform distributions:
> Note 1: Using the average is not really adequate for testing
> under/overestimation of values in my test setup, since I set the
> estimation ratio to zero if there was either an empty MCV list or a
> completely full list. In future runs, I'll drill down a bit more
> precisely.
> That said, there didn't seem to be a huge amount variation between the
> patches as far as either of the ratios I was testing. Looking at the
> number of MCVs is much better anyway, since we know we want either
> none or everything.
>
> Note 2: All patches fully populated the list when all the values fit
> in the MCV list. For the rest of the cases:
>
> -The RSE patch is not much better than master at excluding MCVs.

To make sense of that, you need to look at the errors in the
estimates. The RSE patch won't exclude MCVs if it thinks they will
give reasonably accurate estimates, so may produce fully populated MCV
lists for uniform distributions even when all the distinct values
don't fit. That's not ideal, but it isn't necessarily bad. In my
previous testing I found that it was good at excluding MCVs in cases
where they gave inaccurate estimates.

> -The V2 patch is much better than either of these, but still not ideal
> (up to 30)
> -The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3).
> -With a cutoff of 3.0, all are empty.

Makes sense, but I don't think we should necessarily put too much
emphasis on the uniform tests. Real-world datasets tend to be
non-uniform, and the consequences of too few MCVs tend to be worse
than too many.

I repeated these tests with a 2-SD continuity-corrected threshold and
the results fell somewhere between the 2-SD and 2.5-SD cases. My
inclination is to go with that, as something that strikes a reasonable
balance, and represents a distinct improvement over master in a number
of different areas.

I'll post something tomorrow, once I've finished wordsmithing the comments.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2018-03-18 22:59:43 Re: Online enabling of checksums
Previous Message Tom Lane 2018-03-18 22:47:07 Fixing some issues in matview.c's refresh-query construction