Re: PATCH: adaptive ndistinct estimator v4

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: adaptive ndistinct estimator v4
Date: 2015-04-04 23:57:46
Message-ID: 55207A7A.1050802@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hi,

On 04/03/15 15:46, Greg Stark wrote:
> > The simple workaround for this was adding a fallback to GEE when f[1]
> or f[2] is 0. GEE is another estimator described in the paper, behaving
> much better in those cases.
>
> For completeness, what's the downside in just always using GEE?

That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1
in that paper. The exact formulation of the theorem is a bit complex,
but it essentially says that knowing just the sizes of the data set and
sample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that the
estimator gives you estimates with error exceeding some limit (with a
certain probability).

Knowledge of how much the data set is skewed gives us opportunity to
improve the estimates by choosing an estimator performing better with
such data sets. The problem is we don't know the skew - we can only
estimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch)
is an example of such hybrid estimators, but based on my experiments it
does not work terribly well with one particular type of skew that I'd
expect to be relatively common (few very common values, many very rare
values).

Luckily, GEE performs pretty well in this case, but we can use the AE
otherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated
(pretty much as Theorem 1 in the paper says). I'd be nice to do more
testing on real-world data sets, to see if this performs better or worse
than our current estimator.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-04-05 00:02:58 Re: PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)
Previous Message Petr Jelinek 2015-04-04 23:26:29 Re: PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2015-04-06 20:51:43 Re: Some performance testing?
Previous Message Mel Llaguno 2015-04-03 17:44:30 Re: Some performance testing?