Re: Gsoc2012 idea, tablesample

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <huangqiyx(at)hotmail(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-05-11 15:50:37
Message-ID: 4FACEEFD0200002500047BB3@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> [ uniformly sample the TID space defined as (1..P, 1..M) ]
>
>> Shouldn't that get us the randomly chosen sample we're looking
>> for? Is there a problem you think this ignores?
>
> Not sure. The issue that I'm wondering about is that the line
> number part of the space is not uniformly populated, ie, small
> line numbers are much more likely to exist than large ones. (In
> the limit that density goes to zero, when you pick M much too
> large.) It's not clear to me whether this gives an unbiased
> probability of picking real tuples, as opposed to hypothetical
> TIDs.

I'm convinced it generates a random sample, since every tuple in the
*possible* space has an equal chance of selection, and we ignore the
ones which don't exist or aren't visible, each of the remaining
(existing, visible) tuples is as likely to be chosen as any other.
I'm that's not a formal proof, but I'm sure I could work one up.

> Another issue is efficiency. In practical cases you'll have to
> greatly overestimate M compared to the typical actual-number-of-
> tuples-per-page, which will lead to a number of target TIDs N
> that's much larger than necessary, which will make the scan slow
> --- I think in practice you'll end up doing a seqscan or something
> that might as well be one, because unless S is *really* tiny it'll
> hit just about every page. We can have that today without months
> worth of development effort, using the "WHERE random() < S"
> technique.

I think you've probably got me there. The technique you describe
should be equally random, and thinking through the work involved in
each, the random() < S approach seems almost certain to be cheaper.
Is it worth supporting the TABLESAMPLE BERNOULLI option as syntactic
sugar for that?

-Kevin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-05-11 15:53:57 Re: Can pg_trgm handle non-alphanumeric characters?
Previous Message Florian Pflug 2012-05-11 15:45:18 Re: Gsoc2012 idea, tablesample