| From: | Greg Stark <gsstark(at)mit(dot)edu> | 
|---|---|
| To: | josh(at)agliodbs(dot)com | 
| Cc: | Greg Stark <gsstark(at)mit(dot)edu>, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: More thoughts about planner's cost estimates | 
| Date: | 2006-06-02 20:23:14 | 
| Message-ID: | 87pshre28d.fsf@stark.xeocode.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Josh Berkus <josh(at)agliodbs(dot)com> writes:
> >     Using a variety of synthetic and real-world data sets, we show that
> >     distinct sampling gives estimates for distinct values queries that
> > are within 0%-10%, whereas previous methods were typically 50%-250% off,
> > across the spectrum of data sets and queries studied.
> 
> Aha.  It's a question of the level of error permissable.   For our 
> estimates, being 100% off is actually OK.  That's why I was looking at 5% 
> block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
> cases.
Well, let's see. Say for example we're scanning 50,000 records out of 1M
records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
2 we get that there is at least a 5% chance of getting a result with a ratio
error at least sqrt((1M-50k)/100k ln(20)) or 5.33.
So no, even assuming you have a good unbiased sample, a 5% sample is only
going to get you to a result within 19%-533% of the real values 95% of the
time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
some distributions it may be outside that range even more than 95% of the
time.
This is entirely unlike the histograms where we have a solid foundation for a
positive result. We can guarantee that the result will be outside +/- x% *at
most* 5% of the time. For most distributions it'll be outside that range even
less.
And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
5% block sampling took just as long as reading all the blocks. Even if we
figure out what's causing that (IMHO surprising) result and improve matters I
would only expect it to be 3-4x faster than a full scan.
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php
-- 
greg
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2006-06-02 20:38:59 | Re: COPY (query) TO file | 
| Previous Message | Tino Wildenhain | 2006-06-02 19:43:04 | Re: COPY (query) TO file |