Re: Gsoc2012 idea, tablesample

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-17 23:45:09
Message-ID: CA+CSw_tfz1=VYoY9jmb=Nt5SSP9b_hOdYsfj9RGG8BtxGkOX0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> Well, there may be cases where the quality of the sample isn't
> terribly important, it just needs to be "reasonable."
>
> I browsed an article on the SYSTEM/BERNOULLI representations; they
> both amount to simple picks of tuples.
>
> - BERNOULLI implies picking tuples with a specified probability.
>
> - SYSTEM implies picking pages with a specified probability.  (I think
> we mess with this in ways that'll be fairly biased in view that tuples
> mayn't be of uniform size, particularly if Slightly Smaller strings
> stay in the main pages, whilst Slightly Larger strings get TOASTed...)

Dealing with non uniform sizes isn't too hard. analyze.c already does
that. Given a table with B blocks it takes a uniform sample of b
blocks, and runs Vitter's reservoir sampling algorithm over the
resulting blocks to get s random tuples in a single pass. It's quite
easy to prove that this results in each tuple having an equal
probability to appear in the final table. However, it isn't fully
independent sampling - depending on the value of b compared to s and
B, there is a slightly higher probability to see multiple tuples
picked from one page. I'm too lazy to do the math, but for the analyze
case of b = s it probably isn't significant for most practical
purposes, even if B is really large. And it seems to me that when b >=
s the reservoir sampling thresholds could be tweaked at block
boundaries to even out the dependencies.

The ratio of b to s could be tweaked to get lower quality sampling
(samples are more spatially clumped) in exchange for less random I/O.

> Possibly the forms of sampling that people *actually* need, most of
> the time, are more like Dollar Unit Sampling, which are pretty
> deterministic, in ways that mandate that they be rather expensive
> (e.g. - guaranteeing Seq Scan).

I have a gut feeling that Dollar Unit Sampling and other weighted
samples can be done with a similar approach of uniformly sampling
blocks and then running weighted reservoir sampling [1] over the
result.

[1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

Cheers,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-04-18 01:07:30 Re: Bug tracker tool we need
Previous Message Andrew Dunstan 2012-04-17 23:38:34 Re: Re: [COMMITTERS] pgsql: Don't override arguments set via options with positional argumen