Quick Links

Re: Gsoc2012 idea, tablesample

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Florian Pflug" <fgp(at)phlo(dot)org>
Cc:	<josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-11 14:03:13
Message-ID:	4FACD5D10200002500047B3F@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> wrote:

> Maybe one can get rid of these sorts of problems by factoring in
> the expected density of the table beforehand and simply accepting
> that the results will be inaccurate if the statistics are
> outdated?
>
> One could, for example, simply pick
>
> N := SamplingPercentage * MaxTuplesPerPage /
> AvgLiveTuplesPerPage
>
> where
>
> AvgLiveTuplesPerPage := #Tuples / #Pages
>
> random TIDs, fetch the live ones, and return them.

To clarify, I read this as using reltuples and relpages for the
table, and returning only tuples which are visible according to the
query's snapshot. (i.e., I think you used "live" to mean two
different things there.)

Unless I'm missing something, I think that works for percentage
selection, which is what the standard talks about, without any need
to iterate through addition samples. Good idea! We don't need to
do any second pass to pare down initial results, either. This
greatly simplifies coding while providing exactly what the standard
requires.

> I'm not totally sure whether this approach is sensible to
> non-uniformity in the tuple to line-pointer assignment, though.

I think we can solve that by going high enough with tuple numbers to
reach the highest tuple ID that might be in use in the table, and
*not* following HOT chains. (If we follow HOT chains, we could have
several distinct ctid values which returned the same tuple.) Or am
I thinking about this incorrectly?

> [more complex alternatives]

I really think your first suggestion covers it perfectly; these more
complex techniques don't seem necessary to me.

-Kevin

In response to

Re: Gsoc2012 idea, tablesample at 2012-05-11 11:17:48 from Florian Pflug

Responses

Re: Gsoc2012 idea, tablesample at 2012-05-11 14:27:44 from Tom Lane
Re: Gsoc2012 idea, tablesample at 2012-05-11 15:55:24 from Florian Pflug

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2012-05-11 14:13:17	Re: Gsoc2012 idea, tablesample
Previous Message	Bruce Momjian	2012-05-11 14:01:32	Re: Draft release notes complete