Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-23 19:39:40
Message-ID: 4D13A57C.8040400@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
> the spilling to disk, in exchange for a slightly less precise result...

I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:

1) impossibility to predict the number of elements in advance

To build a Bloom filter with limited error rate, you need to size it
properly (number of hash function and size of the bit array). But
that's exactly the the information we're looking for.

I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1% error, i.e. about 10MB of memory for each million of
elements.

That's not much but this needs to be done for each column separately
and for the combination - for N columns you need (at least) N+1
filters.

Sure - increasing the error rate and using a different estimate to
build the bloom filter would significantly decrease the memory
requirements.

2) sampling the whole table

A naive approach would be to sample the whole table each time, but
for large tables that might be a bit of a problem. So I'm thinking
about what alternatives are there ...

One possibility is to build the Bloom filter incrementally and store
it on disk, or something like that. I guess this could be done
during VACUUM or something like that. Anyway there's an issue how to
set the filter size initially (estimate the number of elements),
just as in the previous section.

Another possibility is to collect the data from just a small portion
of a table and then use the result to estimate the number of distinct
values for the whole table. But I'm not sure we can do this reliably,
I see many traps in this.

But maybe I'm just missing something important.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2010-12-23 19:41:14 Re: keeping a timestamp of the last stats reset (for a db, table and function)
Previous Message Robert Haas 2010-12-23 19:09:13 Re: keeping a timestamp of the last stats reset (for a db, table and function)