Re: [HACKERS] Hint Bits and Write I/O

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] Hint Bits and Write I/O
Date: 2008-06-18 13:53:29
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2008-05-27 at 19:32 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > My proposal is to have this as a two-stage process. When we set the hint
> > on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> > already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> > then we mark it BM_DIRTY.
> I wonder if it is worth actually counting the number of newly set hint
> bits, rather than just having a counter that saturates at two. We could
> steal a byte from usage_count without making the buffer headers bigger.
> > If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers,
> > though on a consistently busy server this should not occur.
> What do you mean by "if it has time"? How would it know that?
> > This won't change the behaviour of first-read-after-copy. To improve
> > that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to
> > BM_DIRTY when we are setting the hint for a new xid. If we are just
> > setting the same xid over-and-over again then we should avoid setting
> > the page dirty. So when data has been loaded via COPY, we will just
> > check the status of the xid once, then scan the whole page using the
> > single-item transaction cache.
> This doesn't make any sense to me. What is a "new xid"? And what is
> "setting the same xid over and over"? If a page is full of occurrences
> of the same xid, that doesn't really mean that it's less useful to
> correctly hint each occurrence.
> The whole proposal seems a bit overly complicated. What we talked about
> at PGCon was simply not setting the dirtybit when setting a hint bit.
> There's a certain amount of self-optimization there: if a page
> continually receives hint bit updates, that also means it is getting
> pinned and hence its usage_count stays high, thus it will tend to stay
> in shared buffers until something happens to make it really dirty.
> (Although that argument might not hold water for a bulk seqscan: you'll
> have hinted all the tuples and then very possibly throw the page away
> immediately. So counting the hints and eventually deciding we did
> enough to justify dirtying the page might be worth doing.)

WIP patch, for discussion and performance evaluation.

This patch changes the way that buffer hints are managed. We can split
the patch conceptually into two parts: how we set buffer hints and what
happens to buffers for which hints have been set.

With this patch, when we set a hint on a buffer we don't dirty the
buffer, though we keep track of hint_count for each buffer. Note that
there is no additional block state, no BM_SET_HINT_BITS etc..

When running a VACUUM command we always dirty the block when setting
hint bits, for a number of reasons:
* VACUUM FULL expects all hint bits to be set prior to moving tuples
* Setting all hint bits allows us to truncate the clog
* it forces the VACUUM to write out its own dirty buffers, which is OK,
since it is a background process.

Other commands call HeapTupleSatisfiesVacuum(), yet these tasks can be
more flexible with hint bit setting. These include ANALYZE, CREATE
INDEX, CLUSTER, HOT pruning and index scan marking deleted tuples (with
changes in all index AMs). This means we have to differentiate between
VACUUM and other callers of HeapTupleSatisfiesVacuum().

So the patch changes the APIs of HeapTupleSatisfiesVacuum(),
SetBufferCommitInfoNeedsSave() and SetHintBits() with changes to 13 AM
and command files. There are many changes in tqual.c, which seems the
right way because SetHintBits() is inlined. These make the patch fairly
large, though most of it is simple changes.

Hints are set on buffers in various ways, most of these ways relate to
tuples, though there are other hints which touch the block itself. This
patch takes the simply assumes that all hints are equal. We might later
choose, for example, that index LP_DEAD hints are worth more than heap
visibility hints. We might also argue that CLUSTER hints should *never*
set the page dirty, since those will be wasted writes if the CLUSTER's
transaction commits. We don't yet do anything like that in buffer
usage_count, so no need to get that complex yet in this case.

So what happens to buffers that have hint_count > 0? During the
bgwriter's normal LRU scan we write out non-dirty buffers when the
number of buffer hints is greater than or equal to a new GUC parameter
called bgwriter_hints_force_write. The default and minimum value for
this parameter is 1, so very similar to existing behaviour. Expected
settings would be 2-5, possibly as high as 20, though those are just
educated guesses. So the maximum is set arbitrarily as 100.

During a checkpoint we write out only fully dirty buffers, which are the
only ones required for correctness. The bgwriter does interrupt itself
to re-commence the LRU scan, so the bgwriter may still write hinted
blocks during the time a checkpoint is in progress. Doing it this way
ensures we don't confuse the two separate goals of correctness and
performance, which currently contributes to the peak write workload
during checkpoint.

When buffer replacement takes place in a backend, we only write out a
victim buffer if it is dirty. We *never* write out non-dirty victim
buffers, whatever their hint count. This speeds up the path for backends
if/when the bgwriter is ineffective and is a change from the current
behaviour. It also avoids adding to the most frequent code path when
blocks aren't hinted at all.

Temp buffers are never dirtied by hint bit setting. Most temp tables are
written in a single command, so that re-accessing clog for temp tuples
is seldom costly. This also changes current behaviour.

We might imagine a slightly smarter bgwriter algorithm or an auto-tuning
mechanism, but this patch is intended to allow the basic principles to
be tested and proven before we worry about such niceties. I've got a few
ideas, just as you probably have by now while reading this.

We set hint_count = 0 only when buffer allocated or assigned from
freelist, minimising additional management overhead. hint_count is
ignored once the buffer has been dirtied.

All of this should have the effect that a table scan on a newly loaded
table will avoid re-writing the table. The scan sets the hints but
doesn't dirty the block, so when it loops around to reuse buffers it
will not need to write them out. The bgwriter may clean some of the
blocks, if it has time, but that's it.

There is one minor strangeness in the patch, which is the change of
initdb's command order when "vacuuming database template1". With the
previous ordering of ANALYZE; VACUUM FULL; VACUUM; the flexible hint bit
setting of the ANALYZE on a freshly bootstrapped database caused a
*consistent* error during the VACUUM FULL which follows it. That took,
(cough, splutter), a little while to resolve. I've added that as a test
to the vacuum regression tests and not found another error (yet?). An
interesting mystery though. :-)

Some instrumentation can be added - I'll add what people think is useful
for attempting to prove this is effective.

I am expecting the patch to reduce I/O on both large and small
databases, though possibly at the expense of scalability. Let's see.

Comments and performance test results, please. I expect to continue
working on the patch for a while yet.

backend/access/gist/gistget.c | 4
backend/access/hash/hash.c | 6 !
backend/access/heap/heapam.c | 6 !
backend/access/heap/pruneheap.c | 10 !!
backend/access/index/indexam.c | 7 !
backend/access/nbtree/nbtinsert.c | 6 !
backend/access/nbtree/nbtree.c | 2
backend/access/nbtree/nbtutils.c | 6 !
backend/catalog/index.c | 3
backend/commands/analyze.c | 9 !
backend/commands/cluster.c | 6 !
backend/commands/vacuum.c | 10 !!
backend/commands/vacuumlazy.c | 6 !
backend/storage/buffer/buf_init.c | 1
backend/storage/buffer/bufmgr.c | 69 +!!!!!!!!!!!!!
backend/storage/buffer/localbuf.c | 1
backend/utils/misc/guc.c | 9 +
backend/utils/misc/postgresql.conf.sample | 2
backend/utils/time/tqual.c | 140 !!!!!!!!!!!!....
bin/initdb/initdb.c | 2
include/storage/buf_internals.h | 3
include/storage/bufmgr.h | 3
include/utils/tqual.h | 4
test/regress/expected/vacuum.out | 2
test/regress/sql/vacuum.sql | 2
25 files changed, 39 insertions(+), 280 modifications(!)

Simon Riggs
PostgreSQL Training, Services and Support

Attachment Content-Type Size
flexible_hint_writes.v4.patch text/x-patch 58.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-06-18 14:07:33 Re: Reducing overhead for repeat de-TOASTing
Previous Message Tom Lane 2008-06-18 13:45:35 Re: Reducing overhead for repeat de-TOASTing

Browse pgsql-patches by date

  From Date Subject
Next Message Alvaro Herrera 2008-06-18 14:32:12 Re: [HACKERS] Hint Bits and Write I/O
Previous Message Tom Lane 2008-06-17 18:36:18 Re: GIN improvements