Re: MCV lists for highly skewed distributions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-02-07 08:51:05
Message-ID: CAEZATCXmzeHM8h63KM4HOO40XvYVw8jxkU0eBgMRBeaUKPjXfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 February 2018 at 12:18, John Naylor <jcnaylor(at)gmail(dot)com> wrote:
> I did the same basic eyeball testing I did on earlier patches, and
> this is the best implementation so far. I've attached some typical
> pg_stats contents for HEAD and this patch. More rigorous testing,
> including of planner estimates on real data, is still needed of
> course, but I think this is definitely in the right direction.
>

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Looking at the too-many-MCVs example from [1], the table has 100
million rows, and each distinct value occurs 10 times. With the
default stats target of 100, the MCV-list is fully populated with
values, most having been seen two (or occasionally three) times. Thus,
if you query for one of those MCVs, the estimate is 6667 or 10,000
rather than 10, which is a pretty bad over-estimate. Increasing the
stats target improves things, because the sample frequencies go down
correspondingly. The problem is also lessened a bit if you randomise
the order of the values in the second column so that it isn't
correlated to the storage order:

insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a;

However in a sample of 30,000 rows it remains reasonably likely that
the same value will be seen twice (I typically observed 40 to 50 MCVs
in this case, c.f. the birthday paradox), and in a sample of 300,000
rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this
isn't really the fault of the non-uniform ANALYSE sample, but rather
of the current algorithm's belief that seeing a value just twice is
sufficient for it to qualify as an MCV.

The new MCV algorithm solves this by demanding that a value be seen
many more times before being included in cases where the table size is
much larger than the sample size. The exact threshold depends on the
table size, the sample size and the relative error cutoff value. In
this example, with a table size of 100 million rows, the sample size
makes little difference, because its always going to be much smaller
than the table size, so the number of instances required in the sample
depends mostly on the RSE cutoff chosen:

rse_cutoff | sample=30000 | sample=300000 | sample=3000000
------------+--------------+---------------+----------------
10% | 100 | 100 | 97
20% | 25 | 25 | 25
30% | 12 | 12 | 11
40% | 7 | 7 | 7
50% | 4 | 4 | 4

So any of those RSE cutoff's solves the too-many-MCVs problem in this
particular case, giving no MCVs, although 50% is pushing it a bit, and
in any case, the theoretical basis for this algorithm breaks down well
before then.

The advantage of a larger RSE cutoff is that it gives more MCVs for a
non-uniform data distribution, where the current algorithm tends to
give too few MCVs. In the attached example, the table has 1 million
rows of integer data in a normal distribution with a standard
deviation of 50. This typically gives somewhere in the region of 430
to 440 distinct values. Running against HEAD and with this patch, for
varying sample sizes and RSE cutoffs gives the following MCV-list
sizes:

stats_target mcvs (HEAD) mcvs (10% RSE) mcvs (20% RSE) mcvs (30% RSE)
10 10 0 10 10
20 20 0 20 20
30 30 0 30 30
40 40 17 40 40
50 50 50 50 50
100 100 100 100 100
300 132 204 261 290
1000 205 266 315 338
3000 252 364 400 411
10000 296 413 414 412

One thing to note is that the HEAD algorithm never includes more than
around 300 values, because of its requirement for a value to be more
common than average. The new algorithm has no such requirement, so it
will include nearly every value in the MCV list (except the most
uncommon ones) if the stats target is set high enough. Also, the
detailed output from the test shows that the resulting estimates based
on those MCVs are pretty decent.

(Note: this example shows that the too-few-MCVs problem can occur in
any non-uniform distribution, not just highly skewed ones.)

I've also tried this out against some real-world data, and in the
testing I've done so far, the new algorithm is not actually that
sensitive to the precise RSE cutoff chosen, but I'm leaning towards a
value of 20% because there are cases where 10% is a little too strict.

Regards,
Dean

[1] https://www.postgresql.org/message-id/flat/20170522132017(dot)29944(dot)48391(at)wrigleys(dot)postgresql(dot)org

Attachment Content-Type Size
test_normal_0_50.sql application/octet-stream 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2018-02-07 09:24:45 Re: [HACKERS] MERGE SQL Statement for PG11
Previous Message Ashutosh Bapat 2018-02-07 08:42:51 Re: [HACKERS] path toward faster partition pruning