Quick Links

Re: Tuple sampling

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Tuple sampling
Date:	2004-05-23 21:32:36
Message-ID:	28169.1085347956@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> This patch implements the new tuple sampling method as discussed on
> -hackers and -performance a few weeks ago.

Applied with minor editorializations. AFAICS get_next_S() needs to be
called with the number of tuples already processed, which means you were
off-by-one --- this surely makes only a trivial difference in the
probabilities, but if we are going to use Vitter's algorithm then we may
as well get it right. Also, I took out the TupleCount typedef and went
back to using doubles for the tuple counts; this is more consistent with
the coding style used elsewhere, and I really doubt that it's any
slower. (The datatype conversions induced inside get_next_S are likely
to outweigh any savings from counting by ints, on most modern hardware.)
Plus the justification for assuming it couldn't overflow seems weak to
me; the current limitation to 300000 requested sample rows is very
arbitrary and could change anytime.

I was initially convinced that your implementation of Knuth's algorithm
S was all wet, so now there's a bunch of comments explaining why it's
actually correct...

regards, tom lane

In response to

Tuple sampling at 2004-05-22 22:59:49 from Manfred Koizar

Responses

Re: Tuple sampling at 2004-05-24 03:57:43 from Bruno Wolff III
Re: Tuple sampling at 2004-05-24 10:29:27 from Manfred Koizar

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Alvaro Herrera	2004-05-23 22:02:12	Nested xacts, try 5
Previous Message	Magnus Hagander	2004-05-23 20:30:32	Re: Cancel/Kill backend functions