Re: Query Sampling

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Varun Kacholia <kacholia(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query Sampling
Date: 2005-08-30 09:14:01
Message-ID: 1125393241.4010.425.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2005-08-29 at 14:58 -0700, Varun Kacholia wrote:
> > I assume you realise that Bernoulli sampling is currently possibly using
> > the random() function and setseed() ?
> Yes, select * from table where random() < x, does the job.
>
> > I can't see why TABLESAMPLE effects a sequential scan *only*, in all
> > cases. I agree that there seems little point in sampling rows from a
> > table when it is already sufficiently restricted that the query could
> > use an index.
> > AFAICS this clause would potentially effect Index and Bitmap scans also,
> > and would be required for full correctness to the standard.
>
> As I see it, there are 3 ways of implementing the sample operator:
> 1. modify node[Seq|Tid|Index|..]scan.c to consider sampling
> 2. create new nodes for each of the possible scans..sequential, index, tid et al
> 3. support sequential scan only for sampling.
>
> (1) does not seem to be attractive, while (2) is a lot of work to
> begin with. I was
> planning to start with (3) and approach to (2) in the long run.
> I would appreciate your opinion on this.

IMHO creating new nodes just for sampling would duplicate too much code.
To me, the Bernoulli sampling sounds like 2-3 carefully placed
statements in the executor nodes and a couple of additions to the node
data structures. (As well as logic in the parser).

Sounds like you would be better off prototyping something for sequential
scans. If you can get Bernoulli working, you can move on to get SYSTEM
working - which needs a deeper reach into the guts of the block request
logic.

We might be able to get away with the thought that SYSTEM sampling will
actually use BERNOULLI sampling when an index or bitmap scan is used.
That would give us standards compliant behaviour without too much effort
(since that effort is essentially wasted, as previously observed).

Best Regards, Simon Riggs

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2005-08-30 09:25:52 Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)
Previous Message Simon Riggs 2005-08-30 07:37:04 Re: Performance gain from reduction of GROUP BY memory