Re: proposal : cross-column stats

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-24 03:41:27
Message-ID: B0AD5578-A5CA-49DE-85AC-8A160A8E22D9@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
> Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> I think that maybe it'd be acceptable to scan a large portion of the
>> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
>> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
>> and spilling them into, say, an on-disk hash table adds even more overhead to the already
>> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
>> the spilling to disk, in exchange for a slightly less precise result...
>
> I've read some basics about a Bloom filters, and I'm not sure we can use
> it in this case. I see two problems with this approach:
>
> 1) impossibility to predict the number of elements in advance
>
> To build a Bloom filter with limited error rate, you need to size it
> properly (number of hash function and size of the bit array). But
> that's exactly the the information we're looking for.
>
> I guess we could use the highest possible value (equal to the number
> of tuples) - according to wiki you need about 10 bits per element
> with 1% error, i.e. about 10MB of memory for each million of
> elements.
Drat. I had expected these number to come out quite a bit lower than
that, at least for a higher error target. But even with 10% false
positive rate, it's still 4.5MB per 1e6 elements. Still too much to
assume the filter will always fit into memory, I fear :-(

> 2) sampling the whole table
>
> A naive approach would be to sample the whole table each time, but
> for large tables that might be a bit of a problem. So I'm thinking
> about what alternatives are there ..

> One possibility is to build the Bloom filter incrementally and store
> it on disk, or something like that. I guess this could be done
> during VACUUM or something like that. Anyway there's an issue how to
> set the filter size initially (estimate the number of elements),
> just as in the previous section.
The filter size could be derived from the table's statistics target, or
be otherwise user-definable. We could also auto-resize once it gets too
full. But still, that all seems awfully complex :-(

> Another possibility is to collect the data from just a small portion
> of a table and then use the result to estimate the number of distinct
> values for the whole table. But I'm not sure we can do this reliably,
> I see many traps in this.
This is how it works currently. The problem with this approach is that
it gives you very little guarantees about how precise the result will be.
Extrapolating works very well for things like MKVs and histograms, because
there you're by definition interested mostly in values which occur often -
and thus with a high probability in the relative few rows you sample. For
the number of distinct values, however, this isn't true - if ndistinct
is an order of magnitude smaller than the number of rows, relatively few
rows can account for a large percentage of the distinct values...

Another idea would be to obtain the ndistinct values from an index somehow.
Postgres cannot currently scan an index in physical order, only in logical
order, due to locking considerations. But since we'd only be interested in
an estimate, maybe a scan in physical block order would work for ndistinc
estimates? Just a wild idea, mind you, I haven't checked at all if that'd
be even remotely feasible.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-12-24 04:00:14 Re: Streaming replication as a separate permissions
Previous Message Florian Pflug 2010-12-24 03:36:51 Re: Streaming replication as a separate permissions