Re: MCV lists for highly skewed distributions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: John Naylor <jcnaylor(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-03-05 14:24:47
Message-ID: CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7 February 2018 at 15:58, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> On 7 February 2018 at 15:25, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Do you plan to press forward with this, then, or what's
>> the next step?
>
> I plan to do more testing

TL;DR: I tested this new algorithm using 9 different relative standard
error cutoffs (10%, 15%, ... 50%) against a variety of different data
distributions, and it looks like a definite improvement over the
current algorithm, for a wide range of cutoff values. Overall, it
isn't that sensitive to the exact cutoff chosen, but it looks like a
value of 20% is a good general-purpose choice.

One of the properties of the new algorithm is that the size of the MCV
list responds better to changing the stats target, so I don't think
it's worth having a separate knob to control the cutoff percentage.
Such a knob would have a similar, but weaker effect than the stats
target knob, and thus doesn't seem worthwhile.

Attached is an updated patch. I decided to move the code that
determines the minimum number of occurrences for an MCV list item out
to a separate function. It's not much code, but there's a large
comment to explain its statistical justification, and I didn't want to
duplicate that in 2 different places (possibly to become 3, with the
multivariate MCV stats patch).

I think this is ready for commit now, so barring objections, I will do
so in the next day or so.

---

I tested using the attached python script, which tests a large number
of different data distributions, comparing the results with the patch
vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates
against actual row counts, both for values in the MCV list, and
non-MCV values, making it possible to tell whether it would have been
better if the MCV list had been bigger or smaller.

There's a lot of random variation between tests, but by running a lot
of them, the general pattern can be seen. I ran the test 9 times, with
different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
then collected all the results from the test runs into a single CSV
file to analyse using SQL. The columns in the CSV are:

test_name - A textual description of the data distribution used in the test.

mcv_cutoff - The relative standard error cutoff percentage used (10,
15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
against HEAD.

stats_tgt - The stats target used (100 or 1000).

num_mcvs - The size of the resulting MCV list.

avg_mcv_err - The average percentage difference between estimated
and actual row counts for values in the MCV list. (There's a bit of a
fudge factor in calculating these percentages, to avoid them becoming
too large in cases where the row counts are small, but I think it's
still useful for comparison purposes.)

max_mcv_err - The maximum percentage difference between estimated
and actual row counts for values in the MCV list.

avg_non_mcv_err - The average percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

max_non_mcv_err - The maximum percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

avg_err - The average percentage difference between estimated and
actual row counts for all values tested.

max_err - The maximum percentage difference between estimated and
actual row counts for all values tested.

From this, it's possible to look for cases where the MCV list is too
large or too small. For example, looking for too-many-mcv cases
(generally the more uniform data distributions):

SELECT mcv_cutoff, count(*)
FROM results
WHERE max_mcv_err - max_non_mcv_err > 50
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | count
------------+-------
-40 | 41
-50 | 40
-25 | 39
-15 | 38
-20 | 38
-30 | 38
-35 | 38
-45 | 37
-10 | 37
50 | 35
40 | 35
45 | 34
35 | 33
30 | 33
25 | 25
20 | 13
15 | 6
10 | 3
(18 rows)

SELECT mcv_cutoff, count(*)
FROM results
WHERE max_mcv_err - max_non_mcv_err > 100
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | count
------------+-------
-30 | 17
-40 | 16
-15 | 15
-25 | 14
-10 | 14
-35 | 14
-45 | 13
-50 | 13
-20 | 12
35 | 12
45 | 11
40 | 10
30 | 10
50 | 10
25 | 6
10 | 2
(16 rows)
( and implicitly:
15 | 0
20 | 0 )

So the patched code is generally better at avoiding the too-many-mcvs
problem, but an mcv_cutoff of 30% or more may be too high.

Looking for too-few-mcv cases:

SELECT mcv_cutoff, count(*)
FROM results
WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt)
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | count
------------+-------
-20 | 120
-50 | 116
-30 | 115
-35 | 114
-25 | 114
-10 | 114
-45 | 113
10 | 113
-40 | 113
-15 | 111
15 | 98
20 | 49
25 | 44
40 | 39
30 | 38
45 | 37
35 | 37
50 | 35
(18 rows)

SELECT mcv_cutoff, count(*)
FROM results
WHERE (max_non_mcv_err - max_mcv_err > 100 AND num_mcvs < stats_tgt)
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | count
------------+-------
-20 | 119
-50 | 116
-30 | 115
-40 | 113
-45 | 112
-25 | 112
-10 | 112
-35 | 112
-15 | 111
10 | 106
15 | 51
20 | 45
25 | 43
30 | 38
40 | 37
35 | 36
45 | 34
50 | 31
(18 rows)

and looking specifically to see to what extent the problem persists
when the stats target is increased to 1000:

SELECT mcv_cutoff, count(*)
FROM results
WHERE max_non_mcv_err - max_mcv_err > 100
AND stats_tgt = 1000
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | count
------------+-------
-20 | 69
-30 | 67
-50 | 67
-35 | 66
-40 | 66
-10 | 66
-15 | 65
-25 | 64
-45 | 63
10 | 60
30 | 8
20 | 8
45 | 8
15 | 8
35 | 7
50 | 7
25 | 7
40 | 6
(18 rows)

So again, the patched code is generally better at avoiding the
too-few-mcvs problem, but an mcv_cutoff of 10% is almost certainly too
low.

Looking at the effect of changing the stats target from 100 to 1000 in
the tests:

SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs)
FROM results r1, results r2
WHERE r2.test_name = r1.test_name
AND r2.mcv_cutoff = r1.mcv_cutoff
AND r1.stats_tgt = 100
AND r2.stats_tgt = 1000
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | sum
------------+-------
15 | 88582
20 | 87124
35 | 86828
10 | 86749
45 | 86632
40 | 86552
50 | 86384
25 | 85567
30 | 85352
-25 | 28596
-15 | 28567
-35 | 28559
-40 | 28517
-20 | 28509
-30 | 28483
-45 | 28464
-50 | 28452
-10 | 28411
(18 rows)

it's clear that the patched code is much better at responding to
increasing the stats target and producing more MCVs.

Interestingly, I noticed while eyeballing the raw test output that
with HEAD there are quite a few instances where increasing the stats
target actually decreases the size of the MCV list, and also these are
often in cases where the data is more non-uniformly distributed, and
the MCV list is too small to start with:

SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs)
FROM results r1, results r2
WHERE r2.test_name = r1.test_name
AND r2.mcv_cutoff = r1.mcv_cutoff
AND r1.stats_tgt = 100
AND r2.stats_tgt = 1000
AND r2.num_mcvs < r1.num_mcvs
GROUP BY 1
ORDER BY 2 DESC;

mcv_cutoff | sum
------------+-------
15 | -4
10 | -11
-35 | -5475
-20 | -5493
-50 | -5507
-40 | -5510
-45 | -5510
-30 | -5511
-25 | -5514
-10 | -5528
-15 | -5532
(11 rows)

I could probably keep going, querying the test result data in all
sorts of other ways (and it's attached, should anyone wish to do so,
although bear in mind that it's subject to quite large random
variations), but my general conclusions are:

- The patched code is better than HEAD at avoiding the too-many-mcvs
problem, unless the mcv_cutoff is set too high (maybe around 30% or
more).

- The patched code is much better at avoiding the too-few-mcvs
problem, unless the mcv_cufoff is set too low (10% seems to be too
low).

- The patched code is much better at producing more MCVs as the stats
target is increased, making it better able to handle more non-uniform
distributions.

- An mcv_cutoff of 20% looks like a good general-purpose value.

Regards,
Dean

Attachment Content-Type Size
mcv-tests.tar.bz2 application/x-bzip 66.9 KB
mcv-stats.patch text/x-patch 8.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2018-03-05 14:41:05 Re: [PoC PATCH] Parallel dump to /dev/null
Previous Message Andreas 'ads' Scherbaum 2018-03-05 14:15:19 Re: [PoC PATCH] Parallel dump to /dev/null