Re: Gsoc2012 idea, tablesample

From: Florian Pflug <fgp(at)phlo(dot)org>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, <josh(at)agliodbs(dot)com>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-05-11 00:38:43
Message-ID: 73B6F18C-C210-4BC2-BEA2-AE9FF50B71B3@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On May10, 2012, at 18:36 , Kevin Grittner wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> I wonder if you could do this with something akin to the Bitmap
>> Heap Scan machinery. Populate a TID bitmap with a bunch of
>> randomly chosen TIDs, fetch them all in physical order
>> and if you don't get as many rows as you need, rinse and repeat
>> until you do.
>
> Ay, there's the rub. If you get too many, it is important that you
> read all the way to the end and then randomly omit some of them.

Why is that? From a statistical point of view it shouldn't matter
whether you pick N random samples, or pick M >= N random samples an
then randomly pick N from M. (random implying uniformly distributed
here).

> While a bit of a bother, that's pretty straightforward and should be
> pretty fast, assuming you're not, like, an order of magnitude high.
> But falling short is tougher; making up the difference could be an
> iterative process, which could always wind up with having you read
> all tuples in the table without filling your sample.

But the likelihood of that happening is extremely low, no? Unless the
sampling percentage is very high, that is, but that case isn't of much
practical importance anyway.

But something else comes to mind. Does the standard permit samples taken
with the BERNOULLI method to contain the same tuple multiple times? If
not, any kind of TID-based approach will have to all previously fetched
TIDs, which seems doable but unfortunate...

best regards,
Florian Pflug

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-05-11 00:41:47 Re: PL/perl elog(ERROR) Does not Abort Transaction
Previous Message David E. Wheeler 2012-05-11 00:27:26 Re: PL/perl elog(ERROR) Does not Abort Transaction