Re: Gsoc2012 idea, tablesample

From: Sandro Santilli <strk(at)keybit(dot)net>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-24 06:49:26
Message-ID: 20120424064926.GE7891@gnash
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
> On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk(at)keybit(dot)net> wrote:
> > I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> > using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
>
> Among other things, this would enable user-space implementation of
> tablesample. Given the operator =~(tid, int) that matches the page
> number and planner/executor integration so that it results in a TID
> scan, you would need the following functions:
>
> random_pages(tbl regclass, samples int) returns int[]
> aggregate function:
> reservoir_sample(item anyelement, samples int) returns anyarray
>
> Implementations for both of the functions could be adapted from analyze.c.
>
> Then tablesample could be implemented with the following query:
> SELECT (SELECT reservoir_sample(some_table, 50) AS samples
> FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
> FROM random_pages('some_table', 50) AS rnd_pgtids;
>
> Actually, now that I think about it, it could actually be implemented
> without any modifications to core at some cost to efficiency.
> random_pages would have to return tid[] that contains for each
> generated pagenumber all possible tids on that page.

This is exactly what I'm after.
I've actually started crafting such a TableSample function and I'm in the
process to refine the signature so your suggested interface above is
very useful, thanks !

But I don't understand the reservoir_sample call, what is it supposed to do ?
And how flexibly "anyarray" return would be ? Could you return arbitrary
typed rowtypes from it ?

> By making the building blocks available users get more flexibility.
> The downside would be that we can't automatically make better sampling
> methods available.

One approach doesn't preclude the other. TABLESAMPLE will still be useful,
also for SQL compliance.

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2012-04-24 06:53:04 Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments
Previous Message Kyotaro HORIGUCHI 2012-04-24 06:47:57 Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments