Greg,
> Using a variety of synthetic and real-world data sets, we show that
> distinct sampling gives estimates for distinct values queries that
> are within 0%-10%, whereas previous methods were typically 50%-250% off,
> across the spectrum of data sets and queries studied.
Aha. It's a question of the level of error permissable. For our
estimates, being 100% off is actually OK. That's why I was looking at 5%
block sampling; it stays within the range of +/- 50% n-distinct in 95% of
cases.
> Doing a bit of basic searching around I think the tool we're looking for
> here is called a "chi-squared test for independence".
Augh. I wrote a program (in Pascal) to do this back in 1988. Now I can't
remember the math. For a two-column test it's relatively
computation-light, though, as I recall ... but I don't remember standard
chi square works with a random sample.
--
--Josh
Josh Berkus
PostgreSQL @ Sun
San Francisco
In response to
Responses
pgsql-hackers by date
| Next: | From: Tino Wildenhain | Date: 2006-06-02 19:43:04 |
| Subject: Re: COPY (query) TO file |
| Previous: | From: Oleg Bartunov | Date: 2006-06-02 18:50:08 |
| Subject: Re: Connection Broken with Custom Dicts for TSearch2 |