Re: estimating # of distinct values

From: tv(at)fuzzy(dot)cz
To: "Josh Berkus" <josh(at)agliodbs(dot)com>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2010-12-28 13:41:24
Message-ID: f739df2d3193e7b66e5c302c23439817.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>> The simple truth is
>>
>> 1) sampling-based estimators are a dead-end
>
> The Charikar and Chaudhuri paper does not, in fact, say that it is
> impossible to improve sampling-based estimators as you claim it does. In
> fact, the authors offer several ways to improve sampling-based
> estimators. Further, 2000 was hardly the end of sampling-estimation
> paper publication; there are later papers with newer ideas.

Well, the paper states that there is a lower bound of the possible error
of a sampling based estimator, depending on the sample size. The actual
inequality is

error(d) >= sqrt( (n-r)/2r * 1/q)

where error(D) is "ratio error" (d - estimated number of distinct values,
D - actual number of distinct values)

error(d) = max{ D/d, d/D }

And all this is with probability q >= e^{-1000} (you can choose this).

Say you have a table with 1.000.000 rows and you use a sample of 1.000
rows to do an estimate. In that case you get

erorr(d) >= 99 with q = 0.05
erorr(d) >= 70 with q = 0.1
error(d) >= 31 with q = 0.5

if you can 10% of the table, you get this

error(d) >= 9.5 with q = 0.05
error(d) >= 6.7 with q = 0.1
error(d) >= 3 with q = 0.5

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower probability
the interval is much wider.

> For example, I still think we could tremendously improve our current
> sampling-based estimator without increasing I/O by moving to block-based
> estimation*. The accuracy statistics for block-based samples of 5% of
> the table look quite good.

Well, that's certainly possible. But you can only achieve the error(d)
lower boundary consistently (for all datasets), you can't do better. And
they've already presented an estimator that does exactly this (called AE -
Adaptive Estimator in the paper).

> I would agree that it's impossible to get a decent estimate of
> n-distinct from a 1% sample. But there's a huge difference between 5%
> or 10% and "a majority of the table".

Sure we can do better. But there's a limit we can't cross no matter what
estimator we choose and how large the sample will be.

I've seen several post-2000 paper on sample-based estimators, but those
are mostly hybrid estimators, i.e. estimators composed of several simple
estimators. So in the end it's pretty complicated, you need to gather a
lot of different stats, and still you can't get better error than the
lower bound :-(

So yes, we can improve the current estimator (making it more complex), but
it does not solve the problem actually.

> Again, don't let this discourage you from attempting to write a
> steam-based estimator. But do realize that you'll need to *prove* its
> superiority, head-to-head, against sxampling-based estimators.
>
> [* http://www.jstor.org/pss/1391058 (unfortunately, no longer
> public-access)]

It's available here http://www.stat.washington.edu/research/reports/1990s/

But it does not present an estimator contradicting the Charikar and
Chaudhuri paper - it's still 'just' a sample-based estimator, alough they
draw the sample at the block level. But yes, that's good idea and I've
already mentioned it in the cross-column stats thread I think.

The question is if a sample obtained in this way will be as good as the
current samples. This way you could get quite separate 'clusters' of
values, one cluster for each block, although in reality the values are
uniformly distributed. And the resulting histogram would be crippled by
this I guess too.

But if you know about interesting papers on sample-based estimators
(especially post-2000), let me know. I've searched for them, but those I
found were not very interesting IMHO.

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-12-28 13:46:31 Re: Function for dealing with xlog data
Previous Message Magnus Hagander 2010-12-28 13:39:37 Re: pg_primary_conninfo