Re: Yet another abort-early plan disaster on 9.3

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-10-10 17:53:36
Message-ID: 54381D20.9010501@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On 10.10.2014 14:10, Tomas Vondra wrote:
> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one. While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the "scalar" case (i.e. data types that we can sort).
Implementing this for the "minimal" case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with
distributions stored in various ways, that I used for testing. In all
cases there's a single table with 10M rows and a single INT column.
There are three kinds of skew:

1) smooth skew

- N distinct values (100, 10.000 and 100.000 values)
- average moves to 0 as 'k' increases ('k' between 1 and 9)
- smooth distribution of frequencies

INSERT INTO test
SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000);

2) step skew

- a few very frequent values, many rare values
- for example this generates 5 very frequent and ~10k rare values

INSERT INTO test
SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END)
FROM (
SELECT (random()*100000)::int AS v
FROM generate_series(1,10000000)
) foo;

Results
=======

I tested this with various statistics target settings (10, 100, 1000),
which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)
======================================================

a) smooth skew, 101 values, different skew ('k')

k current adaptive
-------------------------
1 101 102
3 101 102
5 101 102
7 101 102
9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-------------------------
1 9986 10542
3 8902 10883
5 7579 10824
7 6639 10188
9 5947 10013

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 106 107
106 35 104
1006 259 1262
10006 2823 11047

statistics target 10 (3k rows, 0.03% sample)
============================================

a) smooth skew, 101 values, different skew ('k')

k current adaptive
-------------------------
1 101 102
3 101 102
5 101 102
7 101 102
9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-------------------------
1 9846 10014
3 4399 7190
5 2532 5477
7 1938 4932
9 1623 1623

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 100 114
106 5 5
1006 37 532
10006 323 20970

statistics target 1000 (300k rows, 3% sample)
=============================================

k current adaptive
-------------------------
1 101 102
3 101 102
5 101 102
7 101 102
9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-------------------------
1 10001 10002
3 10000 10000
5 9998 10011
7 9973 10045
9 9939 10114

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 106 107
106 101 107
1006 957 1096
10006 9551 10550

Summary
=======

I'm yet to see an example where the adaptive estimator produces worse
results than the current estimator, irrespectedly of the distribution
and sample size / statistics target.

What really matters is the sample size, with respect to the table size,
so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target.

For the large sample (3%) both estimators produce reasonably accurate
results. This however won't work as the tables grow, because the number
of rows we sample is static (does not grow with the table).

As the sample decreases, the adaptive estimator starts winning. For the
0.3% sample the difference is easily 3x for the high-skew cases. E.g.
for one of the "step skew" distributions the actual ndistinct value is
10006, current estimator gives 2823 while adaptive gives 11047. That's
ratio error ~3.5 vs. 1.1x.

For the tiny 0.03% sample the difference gets even more siginficant,
especially for the step-skew cases, where the improvement is often an
order of magnitude.

Proposal
========

I think the adaptive estimator works very well, and I plan to submit it
to the next commitfest after a bit of polishing. Examples of
distributions that break it are welcome of course.

Also, I think it's clear it'd be useful to be able to scale the sample
proportionally to the table (instead of only allowing the current
statistics target approach). I understand it may result in scanning
large part of the table, but I don't see a problem in that's not a
default behavior (and clearly documented). I don't see a way around that
- small samples simply result in poor estimates.

I was thinking that this could be done using statistics target, but it's
already used for other things (e.g. size of MCV list, histogram) and we
don't want to break that by adding yet another function.

Ideas?

regards
Tomas

Attachment Content-Type Size
ndistinct-estimator.patch text/x-diff 4.8 KB
ndistinct.sql application/sql 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig James 2014-10-10 17:59:52 Re: Yet another abort-early plan disaster on 9.3
Previous Message Fabrízio de Royes Mello 2014-10-10 17:43:38 Re: schema-only -n option in pg_restore fails

Browse pgsql-performance by date

  From Date Subject
Next Message Craig James 2014-10-10 17:59:52 Re: Yet another abort-early plan disaster on 9.3
Previous Message Tomas Vondra 2014-10-10 16:53:51 Re: Yet another abort-early plan disaster on 9.3