From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|

To: | pgsql-hackers(at)postgresql(dot)org |

Subject: | estimating # of distinct values |

Date: | 2010-12-27 16:13:03 |

Message-ID: | 4D18BB0F.5010303@fuzzy.cz |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

Hi everyone,

about two weeks ago I've started a thread about cross-column stats. One

of the proposed solutions is based on number of distinct values, and the

truth is the current distinct estimator has some problems.

I've done some small research - I have read about 20 papers on this, and

I'd like to present a short summary here, possible solutions etc.

The simple truth is

1) sampling-based estimators are a dead-end

2) there are very interesting stream-based estimators

3) the stream-based estimators have some disadvantages

I wrote a more thorough summary on a wiki

(http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the

most interesting papers etc. so just a very short explanation.

1) sampling-based estimators are a dead-end

The paper from Charikar & Chaudhuri states (and proves) that for

each sampling-based estimator, there's a data set where the estimator

produces arbitrary error (with an indispensable probability). So

replacing one sampling-based estimator with another generally does

not improve the situation - fixes one dataset, breaks another one.

The error is directly related to the size of the sample, so the

truth is that to get a good distinct estimate you need to scan a

significant portion of the table (almost all of it).

So the estimators based on tiny samples are a dead end.

2) there are very interesting stream-based estimators

A very interesting idea comes from the field of data streams. These

estimates are based no a one pass through the data, and then

incremental updates. A very nice thing is that these algorithms

don't need that much space, as they don't store the actual values.

The idea behind this is similar to the Bloom filter, i.e. set bits

of a bitmap using a hash of the value. But in this case it's not

required to be able to answer questions like 'is this an element

of the set X?' so it's actually even more space efficient. For an

introduction see the first paper published in 1985 by Flajolet

(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100).

One of the recent articles (published in June 2010) actually presents

an optimal algorithm with O(e^-2 + log(n)) space complexity and O(1)

update complexity. The "e" means precision, i.e. that the estimate

will be within [(1-e)F, (1+e)F] where F is the real value.

The paper on self-learning bitmap states that to track 10^9 distinct

values with 1% relative error you need about 61 kilobits of space

(which is absolutely awesome). The optimal algorithm needs a bit more

space I think,

A very interesting solution id "distinct sampling" that somehow

extends the Wegman's adaptive sampling approach.

3) the stream-based estimators have some disadvantages

Not everything is perfect, though - the most serious disadvantage is

that those algorithms (usually) don't handle removal of elements.

Inserts are easy to handle, but deletes are hard (just as in case of

Bloom filters).

So using this stream-based approach might lead to degradation in

case of heavily updated tables :-(

I see two possible solutions here:

(a) use counters instead of bits, and track actual counts - but this

is tricky, especially with 'probabilistic' algorithms and needs

much more space (but still, 32x 61kb is just 250kB)

(b) count how many deletes/updates were performed, and if needed

rebuild the whole bitmap

But even though these disadvantages, there really is no other

way to enhance the estimates. I don't think this should be a

default behavior - just as in case of cross-column stats this should

be optional when the current estimator does not work well.

regards

Tomas

- Re: estimating # of distinct values at 2010-12-27 21:46:32 from Robert Haas
- Re: estimating # of distinct values at 2010-12-28 06:39:47 from Josh Berkus
- Re: estimating # of distinct values at 2011-01-15 19:26:19 from Tomas Vondra

From | Date | Subject | |
---|---|---|---|

Next Message | Tom Lane | 2010-12-27 16:54:56 | Re: C++ keywords in headers (was Re: [GENERAL] #include <funcapi.h>) |

Previous Message | Alvaro Herrera | 2010-12-27 16:12:54 | Re: C++ keywords in headers (was Re: [GENERAL] #include <funcapi.h>) |