Re: Gsoc2012 idea, tablesample

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-05-10 16:36:45
Message-ID: 4FABA84D0200002500047AE5@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I wonder if you could do this with something akin to the Bitmap
> Heap Scan machinery. Populate a TID bitmap with a bunch of
> randomly chosen TIDs, fetch them all in physical order

It would be pretty hard for any other plan to beat that by very
much, so it seems like a good approach which helps keep things
simple.

> and if you don't get as many rows as you need, rinse and repeat
> until you do.

Ay, there's the rub. If you get too many, it is important that you
read all the way to the end and then randomly omit some of them.
While a bit of a bother, that's pretty straightforward and should be
pretty fast, assuming you're not, like, an order of magnitude high.
But falling short is tougher; making up the difference could be an
iterative process, which could always wind up with having you read
all tuples in the table without filling your sample. Still, this
approach seems like it would perform better than generating random
ctid values and randomly fetching until you've tried them all.

> I'm worried this project is getting so complicated that it will be
> beyond the ability of a new hacker to get anything useful done.
> Can we simplify the requirements here to something that is
> reasonable for a beginner?

I would be inclined to omit monetary unit sampling from the first
commit. Do the parts specified in the standard first and get it
committed. Useful as unit sampling is, it seems like the hardest to
do, and should probably be done "if time permits" or left as a
future enhancement. It's probably enough to just remember that it's
there and make a "best effort" attempt not to paint ourselves in a
corner which precludes its development.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-10 16:37:15 Re: PL/Python result set slicing broken in Python 3
Previous Message Robert Haas 2012-05-10 16:36:18 Re: WIP Patch: Selective binary conversion of CSV file foreign tables