Skip site navigation (1)
Skip section navigation (2)
## Re: More thoughts about planner's cost estimates

### In response to

### Responses

### pgsql-hackers by date

Greg Stark wrote: >Josh Berkus <josh(at)agliodbs(dot)com> writes: > > > >>> Using a variety of synthetic and real-world data sets, we show that >>> distinct sampling gives estimates for distinct values queries that >>>are within 0%-10%, whereas previous methods were typically 50%-250% off, >>>across the spectrum of data sets and queries studied. >>> >>> >>Aha. It's a question of the level of error permissable. For our >>estimates, being 100% off is actually OK. That's why I was looking at 5% >>block sampling; it stays within the range of +/- 50% n-distinct in 95% of >>cases. >> >> > >Well, let's see. Say for example we're scanning 50,000 records out of 1M >records. Using the Theorem from Charikar et al quoted as Theorem 1 in section >2 we get that there is at least a 5% chance of getting a result with a ratio >error at least sqrt((1M-50k)/100k ln(20)) or 5.33. > >So no, even assuming you have a good unbiased sample, a 5% sample is only >going to get you to a result within 19%-533% of the real values 95% of the >time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On >some distributions it may be outside that range even more than 95% of the >time. > >This is entirely unlike the histograms where we have a solid foundation for a >positive result. We can guarantee that the result will be outside +/- x% *at >most* 5% of the time. For most distributions it'll be outside that range even >less. > >And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from >5% block sampling took just as long as reading all the blocks. Even if we >figure out what's causing that (IMHO surprising) result and improve matters I >would only expect it to be 3-4x faster than a full scan. > >http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php > > > I'm sorry to interrupt your esoteric (to me) discussion, but I have a very simple question: would you define a "good unbiased sample"? My statistics professor Dan Price (rest his soul) would tell me there are only random samples of some sort, and "other", which are always biased, and never good. Excuse my absolutes, I didn't mean Good or Evil. And how do you calculate a level of error without randomization? And are you talking of type I or type II error? Michael

- Re: More thoughts about planner's cost estimates at 2006-06-02 20:23:14 from Greg Stark

- Re: More thoughts about planner's cost estimates at 2006-06-02 20:56:11 from David Fetter

Next: From:Tom LaneDate:2006-06-02 20:43:36Subject: Re: More thoughts about planner's cost estimatesPrevious: From: Tom LaneDate: 2006-06-02 20:38:59Subject: Re: COPY (query) TO file