Re: Gsoc2012 idea, tablesample

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-23 17:34:44
Message-ID: CA+CSw_twJkJ0P45oGNXRJ4RFfknEuuR4nVd=GqXUGw4gGm-HUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk(at)keybit(dot)net> wrote:
> I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

Among other things, this would enable user-space implementation of
tablesample. Given the operator =~(tid, int) that matches the page
number and planner/executor integration so that it results in a TID
scan, you would need the following functions:

random_pages(tbl regclass, samples int) returns int[]
aggregate function:
reservoir_sample(item anyelement, samples int) returns anyarray

Implementations for both of the functions could be adapted from analyze.c.

Then tablesample could be implemented with the following query:
SELECT (SELECT reservoir_sample(some_table, 50) AS samples
FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
FROM random_pages('some_table', 50) AS rnd_pgtids;

Actually, now that I think about it, it could actually be implemented
without any modifications to core at some cost to efficiency.
random_pages would have to return tid[] that contains for each
generated pagenumber all possible tids on that page.

By making the building blocks available users get more flexibility.
The downside would be that we can't automatically make better sampling
methods available.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-04-24 02:03:06 Re: Patch: add timing of buffer I/O requests
Previous Message Noah Misch 2012-04-23 16:30:03 psql omits row count under "\x auto"