Quick Links

Re: Estimating number of distinct values.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Estimating number of distinct values.
Date:	2018-10-24 14:27:19
Message-ID:	19857.1540391239@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> writes:
> I will be pleased if somebody (first of all Robert) can comment me
> "strange" results of distinct values estimation.

Estimating the number of distinct values from a small sample is a hard
problem; every algorithm is going to blow it in some cases.

> In my case there are no null values:
> totalrows = 48980014
> samplerows = 30000
> ndistinct = 29135
> nmultiple = 800

You seem to be using the default statistics target. Possibly raising that
would give better answers for this column.

> May be instead of sampling-based estimation use streaming
> based estimation, for example HyperLogLog algorithm already present in
> Postgres?

Maybe, but I'd be really surprised if HLL were any sort of magic bullet.
I think it's intended to make an ndistinct estimate with just a small
amount of state preserved as rows are scanned, which is an admirable
goal but not one that ANALYZE particularly needs. Adopting such a
constraint when we don't need it does not sound like a recipe for
getting a better final answer.

regards, tom lane

In response to

Estimating number of distinct values. at 2018-10-24 14:06:57 from Konstantin Knizhnik

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David G. Johnston	2018-10-24 14:40:00	Re: JSON validation behavior
Previous Message	Sergei Kornilov	2018-10-24 14:24:01	JSON validation behavior