Re: MCV lists for highly skewed distributions

From: John Naylor <jcnaylor(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-03-18 12:24:30
Message-ID: CAJVSVGU5s60-pK7MR9fO_ux+KPortpQRCdFBpFw4ExRm2B+pBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:

> Looks good. I'll run some tests of my own this week, trying to find
> corner cases different from yours.

Over the weekend I tried out a test to measure:
-The size of the MCV list
-The ratio between actual and estimated cardinality of the least
common value in the MCV list.
-The ratio between actual and estimated cardinality of the most common
value not in the MCV list.

The script, results, and some starter queries are attached. I ran a
bunch more queries to drill down in certain areas, but they're all
based on the ones here.

I ran against master, the RSE patch, and 3 different variations of the
V2 patch (with stddev cutoff of 2, 2.5, and 3.0). For each file, I
loaded into each DB, ran ANALYZE 10 times, and took the average for
each of the three metrics above. I only tested with the default stats
target.

For this test, I used the same distributions as Dean's original
script, but I whacked around the script to enable inserting results
into a table.

--
Summary for non-uniform distributions:
Note: There's a lot of variation in the estimation of the most common
non-MCV. About 1% of all ANALYZE runs apparently failed to sample one
of the good candidates for the MCV list, resulting in huge
underestimates. It didn't matter what patch was applied. For future
tests, I'll do more runs and measure a truncated mean. Looking at the
number of MCVs is much more stable, but unlike the uniform
distribution case, we don't have an intuitive feel for what that
number should be. That said,

-The V2 patch is somewhat better than RSE and much better than master
at estimating the most common value not in the MCV list.
-Bumping the sample stddev threshold to 3 seems to behave similarly to
the RSE patch, maybe slightly better.

--
Summary for uniform distributions:
Note 1: Using the average is not really adequate for testing
under/overestimation of values in my test setup, since I set the
estimation ratio to zero if there was either an empty MCV list or a
completely full list. In future runs, I'll drill down a bit more
precisely.
That said, there didn't seem to be a huge amount variation between the
patches as far as either of the ratios I was testing. Looking at the
number of MCVs is much better anyway, since we know we want either
none or everything.

Note 2: All patches fully populated the list when all the values fit
in the MCV list. For the rest of the cases:

-The RSE patch is not much better than master at excluding MCVs.
-The V2 patch is much better than either of these, but still not ideal
(up to 30)
-The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3).
-With a cutoff of 3.0, all are empty.

--
Here's one interesting case. It's a normal distribution, but highly
dispersed (N=500000, sd=1000), so it resembles a uniform one. Looking
at the number of MCVs, it jumps around a lot between patches, but the
estimates don't vary much:

Master: 100
RSE: 1
V2: 100
V2 with 2.5 cutoff: 100
V2 with 3.0 cutoff: 32

--
Thoughts and future testing:
I think the V2 patch strikes a good balance.
I'm partial to the V2 patch with the 2.5 cutoff in sample standard
deviation, since it's much more aggressive about excluding values for
uniform distributions. It's almost as good as V2 at including values
in non-uniform distributions, but possibly worse enough that it's not
the best choice.

Time is running out, but some things I'd like to look at:

-As mentioned above, better noise reduction in my data analysis.
-The continuity-corrected Wald interval Dean mentioned [1]
-I wonder if we can actually untie the max size of the MCV list from
the stats target, and use a hard-coded maximum like 1000. That might
allow us to bump the max stat target without hurting planning time.
Perhaps next cycle.

-John Naylor

[1] https://www.postgresql.org/message-id/CAEZATCVHEEg%2BCrP%2B-0JUsVeNPu5rs_S23oJVeH4VF%3DfgDwhfLQ%40mail.gmail.com

Attachment Content-Type Size
mcv-tests-JCN-v1.py text/x-python 14.2 KB
MCV_queries.sql application/sql 4.0 KB
mcv_results_JCN_0317_full.csv.gz application/x-gzip 42.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Wildish 2018-03-18 12:29:50 Re: Implementing SQL ASSERTION
Previous Message Alvaro Herrera 2018-03-18 09:57:39 Re: Recently-introduced segfault in initdb?