Re: Gsoc2012 idea, tablesample

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Ants Aasma <ants(at)cybertec(dot)at>, Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-24 09:00:55
Message-ID: CA+CSw_up_aWm1LOp4A4RWqPPb56hUuWS8+OMSu3xU78KwXChhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli <strk(at)keybit(dot)net> wrote:
> On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
>> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
>
>> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples
>> >    FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
>> > FROM random_pages('some_table', 50) AS rnd_pgtids;
>>
>> But I don't understand the reservoir_sample call, what is it supposed to do ?
>
> Ok got it, that was probably to avoid:
>
>  ERROR:  more than one row returned by a subquery used as an expression

No, it's to avoid bias towards tuples on more sparsely populated
pages. See http://en.wikipedia.org/wiki/Reservoir_sampling or
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=ff271644e0f93ee99bfe9c1f536f3dd48455d8d2;hb=HEAD#l1027

> The advanced TID operator would be for random_tids to only return pages rather
> than full tids...

Exactly. But when mainly IO bound (ie. sampling from a large table on
spinning rust) the overhead of probing with TID scan as opposed to
sequentially scanning the pages should be small enough. When CPU bound
I suspect that the function call machinery overhead for
reservoir_sample is going to become a large issue, so a built in
tablesample also has an edge there.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Krzysztof Nienartowicz 2012-04-24 09:12:56 Re: [HACKERS] Namespace of array of user defined types is confused by the parser in insert?
Previous Message Michael Meskes 2012-04-24 08:17:14 Re: ECPG FETCH readahead