Re: query slows down with more accurate stats

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: query slows down with more accurate stats
Date: 2004-04-16 22:26:22
Message-ID: gdi08094ik0v3jdlklbc7b0rju02dnovj2@email.aon.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Fri, 16 Apr 2004 10:34:49 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> p = prod from{i = 0} to{n - 1} {{c(B - i)} over {cB - i}}
>
>So? You haven't proven that either sampling method fails to do the
>same.

On the contrary, I believe that above formula is more or less valid for
both methods. The point is in what I said next:
| This probability grows with increasing B.

For the one-stage sampling method B is the number of pages of the whole
table. With two-stage sampling we have to use n instead of B and get a
smaller probability (for n < B, of course). So this merely shows that
the two sampling methods are not equivalent.

>The desired property can also be phrased as "every tuple should be
>equally likely to be included in the final sample".

Only at first sight. You really expect more from random sampling.
Otherwise I'd just put one random tuple and its n - 1 successors (modulo
N) into the sample. This satisfies your condition but you wouldn't call
it a random sample.

Random sampling is more like "every possible sample is equally likely to
be collected", and two-stage sampling doesn't satisfy this condition.

But if in your opinion the difference is not significant, I'll stop
complaining against my own idea. Is there anybody else who cares?

>You could argue that a tuple on a heavily populated page is
>statistically likely to see a higher T when it's part of the page sample
>pool than a tuple on a near-empty page is likely to see, and therefore
>there is some bias against selection of the former tuple. But given a
>sample over a reasonably large number of pages, the contribution of any
>one page to T should be fairly small and so this effect ought to be
>small.

It is even better: Storing a certain number of tuples on heavily
populated pages takes less pages than to store them on sparsely
populated pages (due to tuple size or to dead tuples). So heavily
populated pages are less likely to be selected in stage one, and this
exactly offsets the effect of increasing T.

>So I think this method is effectively unbiased at the tuple level.

Servus
Manfred

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Blasby 2004-04-16 22:51:07 Re: GiST -- making my index faster makes is slower
Previous Message Tom Lane 2004-04-16 22:16:43 Re: GiST -- making my index faster makes is slower

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Stark 2004-04-16 22:57:51 Re: Poor performance of group by query
Previous Message Ron St-Pierre 2004-04-16 21:54:55 Re: Index Problem?