Skip site navigation (1) Skip section navigation (2)

Number of pages in a random sample (was: query slows down with more accurate stats)

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>,pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Number of pages in a random sample (was: query slows down with more accurate stats)
Date: 2004-04-25 20:26:56
Message-ID: tg0o80d1hn166m6vvt6h44qs55ud06ac2k@email.aon.at (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
On Mon, 19 Apr 2004 12:00:10 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>A possible compromise is to limit the number of pages sampled to
>something a bit larger than n, perhaps 2n or 3n.  I don't have a feeling
>for the shape of the different-pages probability function; would this
>make a significant difference, or would it just waste cycles?

I would have replied earlier, if I had a good answer.  What I have so
far contains at least one, probably two flaws.  Knowing not much more
than the four basic arithmetic operations I was not able to improve my
model.  So I post what I have:

As usual we assume a constant number c of tuples per page.  If we have a
table of size B pages and want to collect a sample of n tuples, the
number of possible samples is (again in OOo syntax)

	left( binom{cB}{n} right)

If we select an arbitrary page, the number of possible samples that do
NOT contain any tuple from this page is

	left( binom {c (B-1)} {n} right)

Let's forget about our actual implementations of sampling methods and
pretend we have a perfect random sampling method.  So the probability
Pnot(c, B, n) that a certain page is not represented in a random sample
is

	left( binom {c (B-1)} {n} right) over left( binom{cB}{n} right)

which can be transformed into the more computing-friendly form

	prod from{i=0} to{n-1} {{cB-c - i} over {cB - i}}

Clearly the probability that a certain page *is* represented in a sample
is

	Pyes(c, B, n) = 1 - Pnot(c, B, n)

The next step assumes that these probabilities are independent for
different pages, which in reality they are not.  We simply estimate the
number of pages represented in a random sample as 

	numPag(c, B, n) = B * Pyes(c, B, n)

Here are some results for n = 3000:

	B  \ c->    10 |   100 |   200
	-------+-------+-------+-------
	   100 |  ---  |   100 |   100
	  1000 |   972 |   953 |   951
	  2000 |  1606 |  1559 |  1556
	  3000 |  1954 |  1902 |  1899
	  6000 |  2408 |  2366 |  2363
	  9000 |  2588 |  2555 |  2553
	 20000 |  2805 |  2788 |  2787
	 30000 |  2869 |  2856 |  2856
	100000 |  2960 |  2956 |  2956

This doesn't look to depend heavily on the number of tuples per page,
which sort of justifies the assumption that c is constant.

In the next step I tried to estimate the number of pages that contain
exactly 1, 2, ... tuples of the sample.  My naive procedure works as
follows (I'm not sure whether it is even valid as a rough approximation,
constructive criticism is very welcome):

For c=100, B=3000, n=3000 we expect 1902 pages to contain at least 1
tuple of the sample.  There are 1098 more tuples than pages, these
tuples lie somewhere in those 1902 pages from the first step.
numPag(99, 1902, 1098) = 836 pages contain at least a second tuple.
So the number of pages containing exactly 1 tuple is 1902 - 836 = 1066.
Repeating these steps we get 611 pages with 2 tuples, 192 with 3, 30
with 4, and 3 pages with 5 tuples.

Here are some more numbers for c = 100 and n = 3000:
	
	   B   | pages with 1, 2, ... tuples
	-------+--------------------------------------------------------
	   100 |  1 to 24 tuples: 0, then 1, 2, 4, 10, 18, 26, 24, 11, 4
	  1000 |  108, 201, 268, 229, 113, 29, 5
	  2000 |  616, 555, 292,  83, 12, 1
	  3000 | 1066, 611, 192,  30,  3
	  6000 | 1809, 484,  68,   5
	  9000 | 2146, 374,  32,   2
	 20000 | 2584, 196,   8
	 30000 | 2716, 138,   3
	100000 | 2912,  44

A small C program to experimentally confirm or refute these calculations
is attached.  Its results are fairly compatible with above numbers,
IMHO.

Servus
 Manfred

Attachment: samsim.c
Description: text/plain (3.4 KB)

In response to

Responses

pgsql-performance by date

Next:From: Manfred KoizarDate: 2004-04-25 20:46:54
Subject: Re: Why will vacuum not end?
Previous:From: Shea,Dan [CIS]Date: 2004-04-25 13:05:11
Subject: Re: Why will vacuum not end?

pgsql-hackers by date

Next:From: Bruce MomjianDate: 2004-04-25 20:41:27
Subject: Re: [HACKERS] What can we learn from MySQL?
Previous:From: Andrew DunstanDate: 2004-04-25 20:21:52
Subject: Re: Bringing PostgreSQL torwards the standard regarding

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group