Re: query slows down with more accurate stats

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: query slows down with more accurate stats
Date: 2004-04-16 14:34:49
Message-ID: 29060.1082126089@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> If the number of pages is B and the sample size is n, a perfect sampling
> method collects a sample where all tuples come from different pages with
> probability (in OpenOffice.org syntax):
> p = prod from{i = 0} to{n - 1} {{c(B - i)} over {cB - i}}

So? You haven't proven that either sampling method fails to do the
same.

The desired property can also be phrased as "every tuple should be
equally likely to be included in the final sample". What we actually
have in the case of your revised algorithm is "every page is equally
likely to be sampled, and of the pages included in the sample, every
tuple is equally likely to be chosen". Given that there are B total
pages of which we sample b pages that happen to contain T tuples (in any
distribution), the probability that a particular tuple gets chosen is
(b/B) * (n/T)
assuming that the two selection steps are independent and unbiased.

Now b, B, and n are not dependent on which tuple we are talking about.
You could argue that a tuple on a heavily populated page is
statistically likely to see a higher T when it's part of the page sample
pool than a tuple on a near-empty page is likely to see, and therefore
there is some bias against selection of the former tuple. But given a
sample over a reasonably large number of pages, the contribution of any
one page to T should be fairly small and so this effect ought to be
small. In fact, because T directly determines our estimate of the total
number of tuples in the relation, your experiments showing that the new
method gives a reliable tuple count estimate directly prove that T is
pretty stable regardless of exactly which pages get included in the
sample. So I think this method is effectively unbiased at the tuple
level. The variation in probability of selection of individual tuples
can be no worse than the variation in the overall tuple count estimate.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shalu Gupta 2004-04-16 15:17:13 Accessing RelOptInfo structure from the executor module
Previous Message Andrew Sullivan 2004-04-16 12:11:58 Re: Socket communication for contrib

Browse pgsql-performance by date

  From Date Subject
Next Message Jim C. Nasby 2004-04-16 15:17:06 Poor performance of group by query
Previous Message Tom Lane 2004-04-16 13:49:38 Re: RESOLVED: Re: Wierd context-switching issue on Xeon