Quick Links

Re: serious under-estimation of n_distinct for clustered distributions

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com>
Cc:	postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: serious under-estimation of n_distinct for clustered distributions
Date:	2012-12-30 19:24:21
Message-ID:	CAGTBQpbjdoKob-7M9oO_q9=OJ6Kt0G8dVK1VoZEpxtxCKLbgdw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Sat, Dec 29, 2012 at 5:57 PM, Stefan Andreatta
<s(dot)andreatta(at)synedra(dot)com> wrote:
> n*d / (n - f1 + f1*n/N)
>
> where f1 is the number of values that occurred only once in the sample. n is
> the number of rows sampled, d the number of distincts found and N the total
> number of rows in the table.
>
...
>
> When the number of values that are found only once in the sample (f1)
> becomes zero, the whole term equals d, that is, n_distinct is estimated to
> be just the number of distincts found in the sample.

I think the problem lies in the assumption that if there are no
singly-sampled values, then the sample must have included all distinct
values. This is clearly not true even on a fully random sample, it
only means those sampled distinct values appear frequently enough to
be excluded from the sample.

In the clustered case, the error would be evident if the
randomly-sampled pages were split into two samples, and considered
separately. The distinct set in one would not match the distinct set
in the other, and the intersection's size would say something about
the real number of distinct values in the whole population.

In response to

serious under-estimation of n_distinct for clustered distributions at 2012-12-29 20:57:04 from Stefan Andreatta

Browse pgsql-performance by date

	From	Date	Subject
Next Message	John Kasarda	2012-12-31 02:03:06	Slow connections on Win 7
Previous Message	Stefan Andreatta	2012-12-30 18:02:44	Re: serious under-estimation of n_distinct for clustered distributions