Quick Links

Re: MCV lists for highly skewed distributions

From:	John Naylor <jcnaylor(at)gmail(dot)com>
To:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MCV lists for highly skewed distributions
Date:	2018-03-19 16:59:17
Message-ID:	CAJVSVGUOw4DWVjTK1eb5jOiYVQK8s9m8MejqpdPoRDmwk9icUg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/19/18, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> As promised, here is a new patch, with comment updates, per John and
> Tomas' suggestions, plus the continuity correction, which seemed just
> about worthwhile.

Great. I'm happy with the behavior of the patch. I've marked it ready
for committer.

> I repeated these tests with a 2-SD continuity-corrected threshold and
> the results fell somewhere between the 2-SD and 2.5-SD cases. My
> inclination is to go with that, as something that strikes a reasonable
> balance, and represents a distinct improvement over master in a number
> of different areas.

I also ran some tests with a hand-edited continuity correction last
night, but just now got around to looking at the results (queries
attached). I ran largely as before, but did 20 analyze runs with each
file instead of 10, using the V2 patch, the CC patch, and the V2 patch
with 2.5 threshold. I took out a bunch of the uniform tests to save
time since they largely behave the same.

-----------------
Non-uniform:

To reduce noise for analysis, I tried filtering out the least and
greatest distinct value before taking the average for the
underestimation ratio for most common non-MCV. I also removed results
where all 3 patches had a full MCV list. There was still enough noise
that it's impossible to discern differences between patches that are
very similar. It's not like against HEAD where there were clear
differences in this test, so I won't post those numbers.

Looking at the number of MCVs, it's a little clearer. With few
exceptions, the number either stays the same or decreases slightly
going left to right:

title | v20_num_mcvs |
cc05_num_mcvs | v25_num_mcvs
------------------------------------------+--------------+---------------+--------------
Exp. dist. (N=500k, sd=0.25 (beta)) | 3 |
3 | 3
Exp. dist. (N=500k, sd=0.50 (beta)) | 5 |
6 | 5
Exp. dist. (N=500k, sd=1.00 (beta)) | 9 |
9 | 9
Exp. dist. (N=500k, sd=2.00 (beta)) | 15 |
15 | 15
Exp. dist. (N=500k, sd=3.00 (beta)) | 22 |
21 | 21
Exp. dist. (N=500k, sd=4.00 (beta)) | 27 |
27 | 26
Exp. dist. (N=500k, sd=5.00 (beta)) | 33 |
32 | 30
Exp. dist. (N=500k, sd=10.00 (beta)) | 60 |
58 | 56
Exp. dist. (N=500k, sd=20.00 (beta)) | 100 |
99 | 97
Laplace dist. (N=500k, b=0.25 (lambda)) | 5 |
6 | 5
Laplace dist. (N=500k, b=0.50 (lambda)) | 9 |
8 | 7
Laplace dist. (N=500k, b=1.00 (lambda)) | 15 |
15 | 15
Laplace dist. (N=500k, b=2.00 (lambda)) | 27 |
26 | 26
Laplace dist. (N=500k, b=3.00 (lambda)) | 38 |
37 | 36
Laplace dist. (N=500k, b=4.00 (lambda)) | 48 |
47 | 45
Laplace dist. (N=500k, b=5.00 (lambda)) | 58 |
57 | 55
Laplace dist. (N=500k, b=10.00 (lambda)) | 100 |
99 | 97
MNM (N=2000, peaks=20, sep=20, sd=15) | 100 |
100 | 62
Normal dist. (N=500k, sd=1) | 7 |
7 | 7
Normal dist. (N=500k, sd=2) | 15 |
15 | 14
Normal dist. (N=500k, sd=3) | 21 |
21 | 20
Normal dist. (N=500k, sd=4) | 28 |
26 | 26
Normal dist. (N=500k, sd=5) | 33 |
33 | 33
Normal dist. (N=500k, sd=6) | 39 |
38 | 38
Normal dist. (N=500k, sd=7) | 45 |
45 | 44
Normal dist. (N=500k, sd=8) | 51 |
49 | 49
Normal dist. (N=500k, sd=9) | 56 |
56 | 54
Normal dist. (N=500k, sd=10) | 63 |
62 | 60
Normal dist. (N=500k, sd=15) | 89 |
89 | 86
Pareto dist. (N=500, a=1.00) | 70 |
65 | 56
Pareto dist. (N=500, a=2.00) | 20 |
19 | 17
Pareto dist. (N=500, a=3.00) | 10 |
9 | 9
Pareto dist. (N=500, a=4.00) | 6 |
6 | 6
Pareto dist. (N=500, a=5.00) | 5 |
5 | 4
(34 rows)

-----------------
Uniform:

Ideally, we want an empty MCV list unless we're sampling a large
portion of the table. As Dean mentioned, this is not as important as
getting the non-uniform case right. That said, it's nice to be
confident we can keep out flotsam. The number generally gets smaller
as we go left to right.

title | v20_num_mcvs | cc05_num_mcvs
| v25_num_mcvs
---------------------------------------+--------------+---------------+--------------
Corr. unif. dist. (N=1000k, Nd=1000) | 13 | 9
| 2
Corr. unif. dist. (N=1000k, Nd=10000) | 33 | 8
| 0
Corr. unif. dist. (N=1000k, Nd=200) | 4 | 3
| 1
Corr. unif. dist. (N=1000k, Nd=2000) | 21 | 10
| 1
Corr. unif. dist. (N=1000k, Nd=500) | 9 | 8
| 1
Corr. unif. dist. (N=1000k, Nd=5000) | 15 | 15
| 2
Corr. unif. dist. (N=100k, Nd=1000) | 12 | 7
| 3
Corr. unif. dist. (N=100k, Nd=10000) | 15 | 2
| 0
Corr. unif. dist. (N=100k, Nd=200) | 4 | 3
| 1
Corr. unif. dist. (N=100k, Nd=2000) | 24 | 11
| 2
Corr. unif. dist. (N=100k, Nd=500) | 6 | 7
| 2
Corr. unif. dist. (N=100k, Nd=5000) | 25 | 6
| 1
Corr. unif. dist. (N=30k, Nd=150) | 100 | 100
| 100
Corr. unif. dist. (N=60k, Nd=1000) | 14 | 6
| 1
Corr. unif. dist. (N=60k, Nd=10000) | 0 | 0
| 0
Corr. unif. dist. (N=60k, Nd=200) | 4 | 4
| 1
Corr. unif. dist. (N=60k, Nd=2000) | 16 | 5
| 1
Corr. unif. dist. (N=60k, Nd=500) | 9 | 5
| 1
Corr. unif. dist. (N=60k, Nd=5000) | 16 | 1
| 0
Ran. unif. dist. (N=1000k, Nd=1000) | 16 | 9
| 2
Ran. unif. dist. (N=1000k, Nd=10000) | 33 | 9
| 1
Ran. unif. dist. (N=1000k, Nd=200) | 4 | 4
| 1
Ran. unif. dist. (N=1000k, Nd=2000) | 19 | 11
| 2
Ran. unif. dist. (N=1000k, Nd=500) | 9 | 7
| 1
Ran. unif. dist. (N=1000k, Nd=5000) | 15 | 15
| 2
Ran. unif. dist. (N=100k, Nd=1000) | 12 | 8
| 3
Ran. unif. dist. (N=100k, Nd=10000) | 16 | 1
| 0
Ran. unif. dist. (N=100k, Nd=200) | 4 | 4
| 1
Ran. unif. dist. (N=100k, Nd=2000) | 25 | 12
| 1
Ran. unif. dist. (N=100k, Nd=500) | 7 | 7
| 1
Ran. unif. dist. (N=100k, Nd=5000) | 25 | 6
| 1
Ran. unif. dist. (N=30k, Nd=150) | 100 | 100
| 100
Ran. unif. dist. (N=60k, Nd=1000) | 14 | 7
| 1
Ran. unif. dist. (N=60k, Nd=10000) | 0 | 0
| 0
Ran. unif. dist. (N=60k, Nd=200) | 5 | 4
| 1
Ran. unif. dist. (N=60k, Nd=2000) | 16 | 5
| 1
Ran. unif. dist. (N=60k, Nd=500) | 9 | 6
| 1
Ran. unif. dist. (N=60k, Nd=5000) | 14 | 1
| 0
(38 rows)

-John Naylor

Attachment	Content-Type	Size
MCV_queries_truncated_mean.sql	application/sql	3.1 KB

In response to

Re: MCV lists for highly skewed distributions at 2018-03-19 08:48:44 from Dean Rasheed

Responses

Re: MCV lists for highly skewed distributions at 2018-03-22 09:56:11 from Dean Rasheed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeremy Finzel	2018-03-19 17:10:36	found xmin from before relfrozenxid on pg_catalog.pg_authid
Previous Message	Tom Lane	2018-03-19 16:47:52	Re: Error detail/hint style fixup