Re: Gsoc2012 idea, tablesample

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Florian Pflug" <fgp(at)phlo(dot)org>
Cc: <josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-05-11 14:03:13
Message-ID: 4FACD5D10200002500047B3F@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> wrote:

> Maybe one can get rid of these sorts of problems by factoring in
> the expected density of the table beforehand and simply accepting
> that the results will be inaccurate if the statistics are
> outdated?
>
> One could, for example, simply pick
>
> N := SamplingPercentage * MaxTuplesPerPage /
> AvgLiveTuplesPerPage
>
> where
>
> AvgLiveTuplesPerPage := #Tuples / #Pages
>
> random TIDs, fetch the live ones, and return them.

To clarify, I read this as using reltuples and relpages for the
table, and returning only tuples which are visible according to the
query's snapshot. (i.e., I think you used "live" to mean two
different things there.)

Unless I'm missing something, I think that works for percentage
selection, which is what the standard talks about, without any need
to iterate through addition samples. Good idea! We don't need to
do any second pass to pare down initial results, either. This
greatly simplifies coding while providing exactly what the standard
requires.

> I'm not totally sure whether this approach is sensible to
> non-uniformity in the tuple to line-pointer assignment, though.

I think we can solve that by going high enough with tuple numbers to
reach the highest tuple ID that might be in use in the table, and
*not* following HOT chains. (If we follow HOT chains, we could have
several distinct ctid values which returned the same tuple.) Or am
I thinking about this incorrectly?

> [more complex alternatives]

I really think your first suggestion covers it perfectly; these more
complex techniques don't seem necessary to me.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-05-11 14:13:17 Re: Gsoc2012 idea, tablesample
Previous Message Bruce Momjian 2012-05-11 14:01:32 Re: Draft release notes complete