Re: COUNT(*) and index-only scans

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Greg Stark <stark(at)mit(dot)edu>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COUNT(*) and index-only scans
Date: 2011-10-12 13:52:09
Message-ID: 12693.1318427529@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Oct 12, 2011 at 2:50 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote:
>>> The real issue is that the costing estimates need to be accurate, and
>>> that's where the rubber hits the road.

>> Can you send stats messages to keep track when you unset a bit in the
>> VM? That might allow it to be more up-to-date.

> In theory, that seems like it would work, although I'm a little
> worried about the overhead.

I think it's overkill, and possibly unpleasantly unstable as well.
For the initial attack on this we should just have VACUUM and ANALYZE
count the number of all-visible blocks and store that in pg_class along
with the tuple-count statistics. There's no reason to think that this
will be any worse than the way we deal with dead tuple counts, for
instance.

What bothers me considerably more is the issue about how specific
queries might see an all-visible fraction that's very substantially
different from the table's overall ratio, especially in examples such as
historical-data tables where most of the update and query activity has
to do with recently-added rows. I don't see any practical way to attack
that problem with statistics; we're just going to have to adopt some
estimation rule. What I suggest as a first cut for that is: simply
derate the visibility fraction as the fraction of the table expected to
be scanned gets smaller. That is, if the query fetches nearly all of
the table, take the stored visibility ratio at face value; if it fetches
only one block, never believe that that will be an all-visible block;
and in general if we're expecting to read a fraction f of the pages,
multiply the whole-table visibility ratio by f before using it in the
cost estimate. This amounts to assuming that the historical-data case
is the usual case, but I'm not sure that's unfair.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-10-12 14:12:39 Re: [BUGS] *.sql contrib files contain unresolvable MODULE_PATHNAME
Previous Message Robert Haas 2011-10-12 13:34:51 Re: [BUGS] *.sql contrib files contain unresolvable MODULE_PATHNAME