Re: MCV lists for highly skewed distributions

From: John Naylor <jcnaylor(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-03-19 16:59:17
Message-ID: CAJVSVGUOw4DWVjTK1eb5jOiYVQK8s9m8MejqpdPoRDmwk9icUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/19/18, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> As promised, here is a new patch, with comment updates, per John and
> Tomas' suggestions, plus the continuity correction, which seemed just
> about worthwhile.

Great. I'm happy with the behavior of the patch. I've marked it ready
for committer.

> I repeated these tests with a 2-SD continuity-corrected threshold and
> the results fell somewhere between the 2-SD and 2.5-SD cases. My
> inclination is to go with that, as something that strikes a reasonable
> balance, and represents a distinct improvement over master in a number
> of different areas.

I also ran some tests with a hand-edited continuity correction last
night, but just now got around to looking at the results (queries
attached). I ran largely as before, but did 20 analyze runs with each
file instead of 10, using the V2 patch, the CC patch, and the V2 patch
with 2.5 threshold. I took out a bunch of the uniform tests to save
time since they largely behave the same.

-----------------
Non-uniform:

To reduce noise for analysis, I tried filtering out the least and
greatest distinct value before taking the average for the
underestimation ratio for most common non-MCV. I also removed results
where all 3 patches had a full MCV list. There was still enough noise
that it's impossible to discern differences between patches that are
very similar. It's not like against HEAD where there were clear
differences in this test, so I won't post those numbers.

Looking at the number of MCVs, it's a little clearer. With few
exceptions, the number either stays the same or decreases slightly
going left to right:

title | v20_num_mcvs |
cc05_num_mcvs | v25_num_mcvs
------------------------------------------+--------------+---------------+--------------
Exp. dist. (N=500k, sd=0.25 (beta)) | 3 |
3 | 3
Exp. dist. (N=500k, sd=0.50 (beta)) | 5 |
6 | 5
Exp. dist. (N=500k, sd=1.00 (beta)) | 9 |
9 | 9
Exp. dist. (N=500k, sd=2.00 (beta)) | 15 |
15 | 15
Exp. dist. (N=500k, sd=3.00 (beta)) | 22 |
21 | 21
Exp. dist. (N=500k, sd=4.00 (beta)) | 27 |
27 | 26
Exp. dist. (N=500k, sd=5.00 (beta)) | 33 |
32 | 30
Exp. dist. (N=500k, sd=10.00 (beta)) | 60 |
58 | 56
Exp. dist. (N=500k, sd=20.00 (beta)) | 100 |
99 | 97
Laplace dist. (N=500k, b=0.25 (lambda)) | 5 |
6 | 5
Laplace dist. (N=500k, b=0.50 (lambda)) | 9 |
8 | 7
Laplace dist. (N=500k, b=1.00 (lambda)) | 15 |
15 | 15
Laplace dist. (N=500k, b=2.00 (lambda)) | 27 |
26 | 26
Laplace dist. (N=500k, b=3.00 (lambda)) | 38 |
37 | 36
Laplace dist. (N=500k, b=4.00 (lambda)) | 48 |
47 | 45
Laplace dist. (N=500k, b=5.00 (lambda)) | 58 |
57 | 55
Laplace dist. (N=500k, b=10.00 (lambda)) | 100 |
99 | 97
MNM (N=2000, peaks=20, sep=20, sd=15) | 100 |
100 | 62
Normal dist. (N=500k, sd=1) | 7 |
7 | 7
Normal dist. (N=500k, sd=2) | 15 |
15 | 14
Normal dist. (N=500k, sd=3) | 21 |
21 | 20
Normal dist. (N=500k, sd=4) | 28 |
26 | 26
Normal dist. (N=500k, sd=5) | 33 |
33 | 33
Normal dist. (N=500k, sd=6) | 39 |
38 | 38
Normal dist. (N=500k, sd=7) | 45 |
45 | 44
Normal dist. (N=500k, sd=8) | 51 |
49 | 49
Normal dist. (N=500k, sd=9) | 56 |
56 | 54
Normal dist. (N=500k, sd=10) | 63 |
62 | 60
Normal dist. (N=500k, sd=15) | 89 |
89 | 86
Pareto dist. (N=500, a=1.00) | 70 |
65 | 56
Pareto dist. (N=500, a=2.00) | 20 |
19 | 17
Pareto dist. (N=500, a=3.00) | 10 |
9 | 9
Pareto dist. (N=500, a=4.00) | 6 |
6 | 6
Pareto dist. (N=500, a=5.00) | 5 |
5 | 4
(34 rows)

-----------------
Uniform:

Ideally, we want an empty MCV list unless we're sampling a large
portion of the table. As Dean mentioned, this is not as important as
getting the non-uniform case right. That said, it's nice to be
confident we can keep out flotsam. The number generally gets smaller
as we go left to right.

title | v20_num_mcvs | cc05_num_mcvs
| v25_num_mcvs
---------------------------------------+--------------+---------------+--------------
Corr. unif. dist. (N=1000k, Nd=1000) | 13 | 9
| 2
Corr. unif. dist. (N=1000k, Nd=10000) | 33 | 8
| 0
Corr. unif. dist. (N=1000k, Nd=200) | 4 | 3
| 1
Corr. unif. dist. (N=1000k, Nd=2000) | 21 | 10
| 1
Corr. unif. dist. (N=1000k, Nd=500) | 9 | 8
| 1
Corr. unif. dist. (N=1000k, Nd=5000) | 15 | 15
| 2
Corr. unif. dist. (N=100k, Nd=1000) | 12 | 7
| 3
Corr. unif. dist. (N=100k, Nd=10000) | 15 | 2
| 0
Corr. unif. dist. (N=100k, Nd=200) | 4 | 3
| 1
Corr. unif. dist. (N=100k, Nd=2000) | 24 | 11
| 2
Corr. unif. dist. (N=100k, Nd=500) | 6 | 7
| 2
Corr. unif. dist. (N=100k, Nd=5000) | 25 | 6
| 1
Corr. unif. dist. (N=30k, Nd=150) | 100 | 100
| 100
Corr. unif. dist. (N=60k, Nd=1000) | 14 | 6
| 1
Corr. unif. dist. (N=60k, Nd=10000) | 0 | 0
| 0
Corr. unif. dist. (N=60k, Nd=200) | 4 | 4
| 1
Corr. unif. dist. (N=60k, Nd=2000) | 16 | 5
| 1
Corr. unif. dist. (N=60k, Nd=500) | 9 | 5
| 1
Corr. unif. dist. (N=60k, Nd=5000) | 16 | 1
| 0
Ran. unif. dist. (N=1000k, Nd=1000) | 16 | 9
| 2
Ran. unif. dist. (N=1000k, Nd=10000) | 33 | 9
| 1
Ran. unif. dist. (N=1000k, Nd=200) | 4 | 4
| 1
Ran. unif. dist. (N=1000k, Nd=2000) | 19 | 11
| 2
Ran. unif. dist. (N=1000k, Nd=500) | 9 | 7
| 1
Ran. unif. dist. (N=1000k, Nd=5000) | 15 | 15
| 2
Ran. unif. dist. (N=100k, Nd=1000) | 12 | 8
| 3
Ran. unif. dist. (N=100k, Nd=10000) | 16 | 1
| 0
Ran. unif. dist. (N=100k, Nd=200) | 4 | 4
| 1
Ran. unif. dist. (N=100k, Nd=2000) | 25 | 12
| 1
Ran. unif. dist. (N=100k, Nd=500) | 7 | 7
| 1
Ran. unif. dist. (N=100k, Nd=5000) | 25 | 6
| 1
Ran. unif. dist. (N=30k, Nd=150) | 100 | 100
| 100
Ran. unif. dist. (N=60k, Nd=1000) | 14 | 7
| 1
Ran. unif. dist. (N=60k, Nd=10000) | 0 | 0
| 0
Ran. unif. dist. (N=60k, Nd=200) | 5 | 4
| 1
Ran. unif. dist. (N=60k, Nd=2000) | 16 | 5
| 1
Ran. unif. dist. (N=60k, Nd=500) | 9 | 6
| 1
Ran. unif. dist. (N=60k, Nd=5000) | 14 | 1
| 0
(38 rows)

-John Naylor

Attachment Content-Type Size
MCV_queries_truncated_mean.sql application/sql 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Finzel 2018-03-19 17:10:36 found xmin from before relfrozenxid on pg_catalog.pg_authid
Previous Message Tom Lane 2018-03-19 16:47:52 Re: Error detail/hint style fixup