Bloom filters bloom filters

From: Greg Stark <stark(at)mit(dot)edu>
To: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Bloom filters bloom filters
Date: 2010-01-18 12:11:05
Message-ID: 407d949e1001180411v289644d6ncc04c6d6e4788563@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So the other thread on bloom filter indexes got a discussion going in
real space about other neat things you can do with them and there were
two that seemed useful for Postgres.

We could efficiently get an fairly accurate estimate of the number of
distinct values when doing analyze if we scan the entire table. I'm
not sure what happens now if you do vacuum analyze -- the visibility
map means we don't need to scan the whole table to vacuum it which
means I'm not sure if there's a way to force analyze to see every
tuple. But if we did our current method is bunk and no method based on
a sample of tuples can ever be very precise.

So assuming we provide a way to force analyze to scan every tuple we
could build a bloom filter for each column and count the number of
times we find a value which the bloom filter says is definitely not
seen. That would give a lower bound for ndistinct. Every time the
bloom filter gives a positive we can calculate the false positive rate
given the number of distinct values stored so far and add in a
fractional distinct value to calculate an accurate expected value (or
someone could try to do the calculus necessary to avoid doing the
floating point arithmetic if that's a bottleneck).

This actually gives the expected value for the number of distinct hash
values. That would be a very good estimate for sets << 2^32 but gets
increasingly imprecise for sets that approach that size and larger.

Also this has the problem that the false positive rate depends on the
size of the bloom filter and the set size. I'm assuming we just divide
maintenance_work_mem by natts and allocate a bloom filter for each
column of that size. That means columns with high cardinality might be
much more inaccurate and unless maintenance_work_mem is very large
might be very inaccurate. I suppose we can print a warning suggesting
raising maintenance_work_mem to get better estimates for that column.

But I think it would be much better than our current algorithm. And I
can't see this being patented given that bloom filters were invented
in 1970 and counting distinct values is a pretty obvious application
of them.

--
greg

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-01-18 12:12:31 Re: Hot Standby and handling max_standby_delay
Previous Message Teodor Sigaev 2010-01-18 12:09:40 Re: Bloom index