Re: Gsoc2012 idea, tablesample

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Qi Huang <huangqiyx(at)hotmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, ants(at)cybertec(dot)at, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-05-10 15:33:26
Message-ID: CA+TgmoY1VLK=xJk7Pr++6Saw=OHpx-+ooUpEum6zjoPtzJDi2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 10, 2012 at 10:28 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> One problem I see with this approach is that its efficiency
>> depends on the average tuple length, at least with a naive
>> approach to random ctid generator. The simplest way to generate
>> those randomly without introducing bias is to generate a random
>> page index between 0 and the relation's size in pages,
>
> Right.
>
>> and then generate random tuple index between 0 and
>> MaxHeapTuplesPerPage, which is 291 on x86-64 assuming the standard
>> page size of 8k.
>
> I think we can do better than that without moving too far from "the
> simplest way".  It seems like we should be able to get a more
> accurate calculation of a minimum base tuple size based on the table
> definition, and calculate the maximum number of those which could
> fit on a page.  On the other hand, ctid uses a line pointer, doesn't
> it?  Do we need to worry about dead line pointers allowing higher
> tuple indexes than the calculated maximum number of tuples per page?

I wonder if you could do this with something akin to the Bitmap Heap
Scan machinery. Populate a TID bitmap with a bunch of randomly chosen
TIDs, fetch them all in physical order, and if you don't get as many
rows as you need, rinse and repeat until you do.

I'm worried this project is getting so complicated that it will be
beyond the ability of a new hacker to get anything useful done. Can
we simplify the requirements here to something that is reasonable for
a beginner?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-05-10 15:34:34 Re: Draft release notes complete
Previous Message Bruce Momjian 2012-05-10 15:32:18 Re: Draft release notes complete