Re: Measuring relation free space

From: Noah Misch <noah(at)leadboat(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Greg Smith <greg(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring relation free space
Date: 2012-02-22 05:27:47
Message-ID: 20120222052747.GE8592@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote:
> On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> >
> > With all that done, run some quick benchmarks: see how "SELECT free_percent
> > FROM pgstattuple(rel)" fares compared to "SELECT relation_free_space(rel)" for
> > a large heap and for a large B-tree index. ?If the timing difference is too
> > small to be interesting to you, remove relation_free_space() and submit your
> > pgstattuple() improvements alone. ?Otherwise, submit as written.
> >
>
> Ok. I split this in three patches.
>
> 1) pgstattuple-gin_spgist.patch
> This first patch adds gin and spgist support to pgstattuple, also
> makes pgstattuple use a ring buffer when reading tables or indexes.

The buffer access strategy usage bits look fine to commit. The gin and spgist
support has problems, detailed below.

> 2) pgstattuple-relation_free_space.patch
> This patch adds the relation_free_space function to pgstattuple.
>
> the function relation_free_space() is faster than pgstattuple(), to
> test that i initialize pgbench with a scale of 40.
> In that context pgstattuple() tooks 1.4s to process pgbench_account
> table and relation_free_space() tooks 730ms (half the time!)
> In the index the difference is less notorious, 170ms the former and
> 150ms the latter.

Benchmarks lasting on the order of one second are far too short. I tried the
first two patches on this 6914 MiB table and 4284 MiB index:

create table t(n) as select * from generate_series(1,200000000);
create index on t(n);

This machine has about 1 GiB of memory available for disk cache, and I used
shared_buffers = 128MB. I used a regular assert-enabled build with
debug_assertions = off. Timings:

pgstattuple.free_percent, heap: runtime 166.2s; answer 0.34
pgstattuple.free_percent, index: runtime 110.9s; answer 9.83
relation_free_space, heap: runtime 165.1s; answer 0.00838721
relation_free_space, index: runtime 108.7s; answer 2.23692

Note the disagreement in answers and the nonsensical answer from the last
test. The numbers do line up for smaller tables and indexes that I tried.

During the pgstattuple() runs on the heap, CPU usage divided evenly between
user and iowait time. With relation_free_space(), iowait kept above 80%. For
the index, pgstattuple() managed 60-80% iowait and relation_free_space() again
kept above 80%. Even so, that did not produce any significant change in
runtime. I'm guessing that readahead was highly effective here, so the I/O
bound dictated elapsed time.

Bottom line, this patch can probably achieve 50% speedups on already-in-memory
relations. It can reduce the contribution to CPU load, but not the elapsed
runtime, for relations we largely pull from disk. Do those benefits justify
the additional user-visible interface? I suppose the sort of installation
that would benefit most is one just short of the tipping point of the database
size exceeding memory size. Larger databases could not afford either
function, and smaller databases don't need to watch bloat so closely.
Offhand, I think that the I/O savings of sampling will be the real win, and
it's not worth an extra user-visible function to get the CPU usage savings
this patch offers. Other opinions welcome.

> 3) pgstattuple-stats_target.patch
> This patch adds a stats_target parameter to the relation_free_space()
> function, it mimics the way analyze choose the blocks to read and is
> faster than plain relation_free_space() but of course could be inexact
> if the pages that we don't read are the ones with more free space

This part is a fresh submission. It is simple enough that I have reviewed it.
It gives the expected speedup. However, the results are wrong:

3 runs of pgstattuple('t', 5): 0.000171412, 0.000171876, 0.000169326
3 runs of pgstattuple('t', 10): 0.000336525, 0.000344404, 0.000341314

I can get an apparent infinite loop:

create table t0(n) as select * from generate_series(1,4000000);
create index on t0(n);

[local] test=# select relation_free_space('t0_n_idx', 100);
relation_free_space
---------------------
0.0982675

Time: 133.788 ms

[local] test=# select relation_free_space('t0_n_idx', 5);
Cancel request sent
ERROR: canceling statement due to user request
Time: 24655.481 ms

As a way forward, I suggest abandoning relation_free_space() entirely and
adding a sampling capability to pgstattuple(). There are clear gains to be
had from that method. The gains of splitting out the free percent calculation
from the other pgstattuple() calculations seem meager.

If you would like, submit the buffer strategy bits as a separate patch with
its own CF entry, noting that it arose from here. That part can be Ready for
Committer. I'm marking the original CF entry Returned with Feedback.

Patch 1:

> *** a/contrib/pgstattuple/pgstatindex.c
> --- b/contrib/pgstattuple/pgstatindex.c

> + /*
> + * Buffer access strategy for reading relations, it's simpler to keep it
> + * global because pgstat_*_page() functions read one buffer at a time.
> + * pgstat_heap() and pgstat_index() should initialize it before use.
> + */
> + BufferAccessStrategy bstrategy;

This variable should be static.

> *** 450,455 ****
> --- 471,538 ----
> }
>
> /*
> + * pgstat_gin_page -- check tuples in a gin page
> + */
> + static void
> + pgstat_gin_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
> + {
> + Buffer buf;
> + Page page;
> +
> + buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
> + LockBuffer(buf, GIN_SHARE);
> + page = BufferGetPage(buf);
> +
> + if (GinPageIsDeleted(page))
> + {
> + /* recyclable page */
> + stat->free_space += BLCKSZ;
> + }
> + else if (GinPageIsLeaf(page) || GinPageIsData(page))
> + {
> + pgstat_index_page(stat, page, FirstOffsetNumber,
> + PageGetMaxOffsetNumber(page));
> + }
> + else
> + {
> + /* root or node */
> + }
> +
> + UnlockReleaseBuffer(buf);
> + }

Would you discuss, preferably in comments, the various page types found in GIN
indexes and why this code is correct for all of them? At a minimum, the
comment "root or node" is incorrect.

Another thing to consider and document is how you wish to count tuples. Your
implementation counts every key as a tuple. I think it would be more useful
to count every posting tree entry as a tuple, but I haven't given it a great
deal of thought.

> +
> + /*
> + * pgstat_spgist_page -- check tuples in a spgist page
> + */
> + static void
> + pgstat_spgist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
> + {
> + Buffer buf;
> + Page page;
> +
> + buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
> + LockBuffer(buf, BUFFER_LOCK_SHARE);
> + page = BufferGetPage(buf);
> +
> + if (SpGistPageIsDeleted(page))
> + {
> + /* recyclable page */
> + stat->free_space += BLCKSZ;
> + }
> + else if (SpGistPageIsLeaf(page))
> + {
> + pgstat_index_page(stat, page, FirstOffsetNumber,
> + PageGetMaxOffsetNumber(page));
> + }
> + else
> + {
> + /* root or node */
> + }
> +
> + UnlockReleaseBuffer(buf);
> + }

pgstat_index_page will not do the right thing for SP-GiST indexes. Only
SPGIST_LIVE tuples should count as live; the other tupstates should count as
dead. Also, pgstattuple gives me a tuple count of 119288 for an SP-GiST index
of a table containing only 40000 tuples. (Disclaimer: I took my first look at
the SP-GiST code today.)

Patch 3:

> *** a/contrib/pgstattuple/pgstattuple.c
> --- b/contrib/pgstattuple/pgstattuple.c
> *************** GetHeapRelationFreeSpace(Relation rel)
> *** 691,716 ****
> {
> /* Get the current relation length */
> LockRelationForExtension(rel, ExclusiveLock);
> ! nblocks = RelationGetNumberOfBlocks(rel);
> UnlockRelationForExtension(rel, ExclusiveLock);
>
> /* Quit if we've scanned the whole relation */
> if (blkno >= nblocks)
> {
> break;
> }
>
> ! for (; blkno < nblocks; blkno++)
> {
> CHECK_FOR_INTERRUPTS();
>
> buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
> LockBuffer(buffer, BUFFER_LOCK_SHARE);
>
> page = BufferGetPage(buffer);
> free_space += PageGetHeapFreeSpace(page);
>
> UnlockReleaseBuffer(buffer);
> }
> }
>
> --- 700,733 ----
> {
> /* Get the current relation length */
> LockRelationForExtension(rel, ExclusiveLock);
> ! totalBlocksInRelation = RelationGetNumberOfBlocks(rel);
> UnlockRelationForExtension(rel, ExclusiveLock);
>
> + nblocks = totalBlocksInRelation * stats_target / 100;

You have defined stats_target as a percentage of the relation to sample.
That's a poor basis for sample size. Use a constant number of pages, just as
a given default_statistics_target leads ANALYZE to take a constant number of
tuples from each table. Further background:
http://en.wikipedia.org/wiki/Sample_size_determination#Estimating_proportions_and_means

Excepting those few copied from analyze.c, this patch adds zero comments. You
even omit comments present in the corresponding analyze.c code.

nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2012-02-22 07:06:49 Re: 16-bit page checksums for 9.2
Previous Message Greg Smith 2012-02-22 04:55:06 Re: Displaying accumulated autovacuum cost