Re: estimating # of distinct values

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2011-01-15 19:26:19
Message-ID: 4D31F4DB.9050305@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it, just download the sources).

You can build the estimators as standalone benchmarks (build.sh) or as
aggregate functions (build-agg.sh) - I guess the latter is much easier
to play with. The installation is described on my blog, along with some
two simple benchmarks

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

Feel free to play with it with different data, and if you notice
something strange just let me know.

Several comment regarding the implemented estimators:

1) PCSA, Probabilistic Counting (both described in the 1985 paper)

- quite good performance, especially with respect to the memory
requirements (95 and 495 B)

2) Adaptive Sampling (1990), Self-Learning Bitmap (2009)

- very different algorithms, almost the same performance (especially
with large number of distinct values)

- my personal favourites, as the memory requirements are still very
reasonable (compared to the Bloom Counter)

3) Rough Estimator (2010)

- this algorithm is described as an "optimal algorithm" (not
exactly, it's a building block of the optimal algorithm), but the
results are not as good as with (1) and (2)

- one of the problems is it needs k-wise independent hash functions,
and the MD5 hashing scheme used with the other algorithms probably
does not conform to this condition :-(

- I've tried to implement such hash functions on my own (using prime
field and polynomials in modulo arithmetics), but the results were
quite bad - if you know how to get such hash functions, let me know.

- this algorithm is much more complex than the other algorithms, so
if someone could do a review, that would be nice (maybe there is a
bug resulting in the big estimate error)

4) Bloom Counter

- basically a Bloom filter + a counter (incremented when an new
element is added to the filter)

- due to the design (no false negatives, positive false positives)
it's significantly biased (gives underestimates), but I think that
can be easily fixed with a coefficient (depends on the number of
items / false positive eror rate)

- needs significantly more memory to get similar performance as the
other algorithms (a Bloom counter that can track 1 milion distinct
values needs about 1.2MB of memory, while a S-Bitmap that tracks 10
milion items needs just 65kB)

- so unless we can use the special feature of Bloom Filter (i.e. the
ability to detect items that are in the data), this is a waste of
memory

I know Google uses Bloom filters in BigTable to limit I/O, i.e.
before doing the I/O they ask the Bloom filter if there are such
data - if the answer is no, they don't have to do the I/O (false
negatives not allowed), if the answer is yes they do the I/O.

Not sure if we can / want to do something like this in PostgreSQL,
as it significantly depends on the workload (and in most cases we
know the data are there, so we have to do the I/O anyway). But I
guess it might be a good idea for a contrib module (to maintain the
Bloom filter on your own and do the decision on application). The
tricky part here is to maintain the bloom filter up to date.

I'll try to fix the optimal estimator (not sure how to do that right
now), and I'll investigate the proposed 'relation fork' proposed as a
way to store the estimator.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-01-15 19:27:21 Re: LOCK for non-tables
Previous Message Robert Haas 2011-01-15 19:25:44 Re: LAST CALL FOR 9.1