Quick Links

Re: estimating # of distinct values

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Tomas Vondra" <tv(at)fuzzy(dot)cz>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: estimating # of distinct values
Date:	2010-12-30 14:43:49
Message-ID:	97F89154-01B4-4F22-ADD3-48D96C84927A@phlo.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> With respect to (b), I think I'd need to see a much more detailed
>> design for how you intend to make this work. Off the top of my
>> head there seems to be some pretty serious feasibility problems.
>
> I had one random thought on that -- it seemed like a large concern
> was that there would need to be at least an occasional scan of the
> entire table to rebuild the distinct value information.

I believe we could actually avoid that.

First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
actually contains an algorithm with *does* handle deletes - it's called
"L_0" estimate there.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

A good value for B would probably be around
1000*<size of bitfield>/<page size>. If the bitfield needs ~100k, that'd
make B ~= 12000 pages ~= 100MB.

best regards,
Florian Pflug

In response to

Re: estimating # of distinct values at 2010-12-27 22:49:51 from Kevin Grittner

Responses

Re: estimating # of distinct values at 2010-12-30 19:38:03 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Florian Pflug	2010-12-30 14:49:09	Re: Snapshot synchronization, again...
Previous Message	Alvaro Herrera	2010-12-30 14:40:11	Re: Snapshot synchronization, again...