Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)
Date: 2015-02-19 03:08:33
Message-ID: 54E553B1.6030601@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On 23.12.2014 11:28, Heikki Linnakangas wrote:
> On 12/07/2014 03:54 AM, Tomas Vondra wrote:
>> The one interesting case is the 'step skew' with statistics_target=10,
>> i.e. estimates based on mere 3000 rows. In that case, the adaptive
>> estimator significantly overestimates:
>>
>> values current adaptive
>> ------------------------------
>> 106 99 107
>> 106 8 6449190
>> 1006 38 6449190
>> 10006 327 42441
>>
>> I don't know why I didn't get these errors in the previous runs, because
>> when I repeat the tests with the old patches I get similar results with
>> a 'good' result from time to time. Apparently I had a lucky day back
>> then :-/
>>
>> I've been messing with the code for a few hours, and I haven't
>> found any significant error in the implementation, so it seems that
>> the estimator does not perform terribly well for very small samples
>> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%).
>
> The paper [1] gives an equation for an upper bound of the error of this
> GEE estimator. How do the above numbers compare with that bound?

Well, that's a bit more complicated because the "Theorem 1" you mention
does not directly specify upper boundary for a single estimate. It's
formulated like this:

Assume table with "N" rows, D distinct values and sample of "r"
rows (all those values are fixed). Then there exists a dataset with
those features, so that "ratio error"

error(D, D') = max(D'/D, D/D')

is greater than f(N, r, P) with probability at least "P". I.e. if you
randomly choose a sample of 'r' rows, you'll get an error exceeding
the ratio with probability P.

So it's not not a hard limit, but speaks about probability of estimation
error with some (unknown) dataset dataset. So it describes what you can
achieve at best - if you think your estimator is better, there'll always
be a dataset hiding in the shadows hissing "Theorem 1".

Let's say we're looking for boundary that's crossed only in 1% (or 5%)
of measurements. Applying this to the sample data I posted before, i.e.
10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows),
the ratio error boundary per the the paper is

3.000 30.000 300.000
----------------------------------------
1% 88 28 9
5% 70 22 7

At least that's what I get if I compute it using this python function:

def err(N, r, p):
return sqrt((N-r)/(2.0*r) * log(1.0/p))

So the estimates I posted before are not terribly good, I guess,
especially the ones returning 6449190. I wonder whether there really is
some stupid bug in the implementation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-02-19 03:29:18 Re: pgaudit - an auditing extension for PostgreSQL
Previous Message Gavin Flower 2015-02-19 02:38:28 Re: Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]

Browse pgsql-performance by date

  From Date Subject
Next Message Nicolas Paris 2015-02-20 10:28:27 PG 9.3 materialized view VS Views, indexes, shared memory
Previous Message Nico Sabbi 2015-02-14 19:42:26 Re: Configuration tips for very large database