Re: Gsoc2012 idea, tablesample

From: Qi Huang <huangqiyx(at)hotmail(dot)com>
To: <ants(at)cybertec(dot)at>, <cbbrowne(at)gmail(dot)com>
Cc: <sfrost(at)snowman(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-18 03:11:57
Message-ID: BAY159-W399366318A30988BCF81FDA33C0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Date: Wed, 18 Apr 2012 02:45:09 +0300
> Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
> From: ants(at)cybertec(dot)at
> To: cbbrowne(at)gmail(dot)com
> CC: sfrost(at)snowman(dot)net; pgsql-hackers(at)postgresql(dot)org
>
> On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> > Well, there may be cases where the quality of the sample isn't
> > terribly important, it just needs to be "reasonable."
> >
> > I browsed an article on the SYSTEM/BERNOULLI representations; they
> > both amount to simple picks of tuples.
> >
> > - BERNOULLI implies picking tuples with a specified probability.
> >
> > - SYSTEM implies picking pages with a specified probability. (I think
> > we mess with this in ways that'll be fairly biased in view that tuples
> > mayn't be of uniform size, particularly if Slightly Smaller strings
> > stay in the main pages, whilst Slightly Larger strings get TOASTed...)
Looking at the definition of BERNOULLI method and it means to scan all the tuples, I always have a question. What is the difference of using BERNOULLI method with using "select * .... where rand() < 0.1"? They will both go through all the tuples and cost a seq-scan. If the answer to the above question is "no difference", I have one proposal for another method of BERNOULLI. For a relation, we can have all their tuples assigned an unique and continuous ID( we may use ctid or others). Then for each number in the set of IDs, we assign a random number and check whether that is smaller than the sampling percentage. If it is smaller, we retrieve the tuple corresponding to that ID. This method will not seq scan all the tuples, but it can sample by picking tuples.Thanks

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-04-18 03:22:07 Re: Bug tracker tool we need
Previous Message Greg Sabino Mullane 2012-04-18 03:07:58 Re: Bug tracker tool we need