Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spread checkpoint sync
Date: 2011-02-07 07:07:41
Message-ID: 4D4F9A3D.5070700@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> With the fsync queue compaction patch applied, I think most of this is
> now not needed. Attached please find an attempt to isolate the
> portion that looks like it might still be useful. The basic idea of
> what remains here is to make the background writer still do its normal
> stuff even when it's checkpointing. In particular, with this patch
> applied, PG will:
>
> 1. Absorb fsync requests a lot more often during the sync phase.
> 2. Still try to run the cleaning scan during the sync phase.
> 3. Pause for 3 seconds after every fsync.
>

Yes, the bits you extracted were the remaining useful parts from the
original patch. Today was quiet here because there were sports on or
something, and I added full auto-tuning magic to the attached update. I
need to kick off benchmarks and report back tomorrow to see how well
this does, but any additional patch here would only be code cleanup on
the messy stuff I did in here (plus proper implementation of the pair of
GUCs). This has finally gotten to the exact logic I've been meaning to
complete as spread sync since the idea was first postponed in 8.3, with
the benefit of some fsync aborption improvements along the way too

The automatic timing is modeled on the existing
checkpoint_completion_target concept, except with a new tunable (not yet
added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right
now. What I think I want to do is make the existing
checkpoint_completion_target now be the target for the end of the sync
phase, matching its name; people who bumped it up won't necessarily even
have to change anything. Then the new guc can be
checkpoint_write_target, representing the target that is in there right now.

I tossed the earlier idea of counting relations to sync based on the
write phase data as too inaccurate after testing, and with it for now
goes checkpoint sorting. Instead, I just take a first pass over
pendingOpsTable to get a total number of things to sync, which will
always match the real count barring strange circumstances (like dropping
a table).

As for the automatically determining the interval, I take the number of
syncs that have finished so far, divide by the total, and get a number
between 0.0 and 1.0 that represents progress on the sync phase. I then
use the same basic CheckpointWriteDelay logic that is there for
spreading writes out, except with the new sync target. I realized that
if we assume the checkpoint writes should have finished in
CheckPointCompletionTarget worth of time or segments, we can compute a
new progress metric with the formula:

progress = CheckPointCompletionTarget + (1.0 -
CheckPointCompletionTarget) * finished / goal;

Where "finished" is the number of segments written out, while "goal" is
the total. To turn this into an example, let's say the default
parameters are set, we've finished the writes, and finished 1 out of 4
syncs; that much work will be considered:

progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625

On a scale that effectively aimes to be finished sync work by 0.8.

I don't use quite the same logic as the CheckpointWriteDelay though. It
turns out the existing checkpoint_completion implementation doesn't
always work like I thought it did, which provide some very interesting
insight into why my attempts to work around checkpoint problems haven't
worked as well as expected the last few years. I thought that what it
did was wait until an amount of time determined by the target was
reached until it did the next write. That's not quite it; what it
actually does is check progress against the target, then sleep exactly
one nap interval if it is is ahead of schedule. That is only the same
thing if you have a lot of buffers to write relative to the amount of
time involved. There's some alternative logic if you don't have
bgwriter_lru_maxpages set, but in the normal situation it effectively it
means that:

maximum write spread time=bgwriter_delay * checkpoint dirty blocks

No matter how far apart you try to spread the checkpoints. Now,
typically, when people run into these checkpoint spikes in production,
reducing shared_buffers improves that. But I now realize that doing so
will then reduce the average number of dirty blocks participating in the
checkpoint, and therefore potentially pull the spread down at the same
time! Also, if you try and tune bgwriter_delay down to get better
background cleaning, you're also reducing the maximum spread. Between
this issue and the bad behavior when the fsync queue fills, no wonder
this has been so hard to tune out of production systems. At some point,
the reduction in spread defeats further attempts to reduce the size of
what's written at checkpoint time, by lowering the amount of data involved.

What I do instead is nap until just after the planned schedule, then
execute the sync. What ends up happening then is that there can be a
long pause between the end of the write phase and when syncs start to
happen, which I consider a good thing. Gives the kernel a little more
time to try and get writes moving out to disk. Here's what that looks
like on my development desktop:

2011-02-07 00:46:24 EST: LOG: checkpoint starting: time
2011-02-07 00:48:04 EST: DEBUG: checkpoint sync: estimated segments=10
2011-02-07 00:48:24 EST: DEBUG: checkpoint sync: naps=99
2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=1
file=base/16736/16749.1 time=12033.898 msec
2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=2
file=base/16736/16749 time=60.799 msec
2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: naps=59
2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: number=3
file=base/16736/16756 time=0.003 msec
2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: number=4
file=base/16736/16750 time=0.003 msec
2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: number=5
file=base/16736/16737 time=0.004 msec
2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: number=6
file=base/16736/16749_fsm time=0.004 msec
2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: number=7
file=base/16736/16740 time=0.003 msec
2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: number=8
file=base/16736/16749_vm time=0.003 msec
2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: number=9
file=base/16736/16752 time=0.003 msec
2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: naps=60
2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: number=10
file=base/16736/16754 time=0.003 msec
2011-02-07 00:50:12 EST: LOG: checkpoint complete: wrote 14335 buffers
(43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled;
write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10,
longest=12.033 s, average=1.209 s

Since this is ext3 the spike during the first sync is brutal, anyway,
but it tried very hard to avoid that: it waited 99 * 200ms = 19.8
seconds between writing the last buffer and when it started syncing them
(00:42:04 to 00:48:24). Given the slow write for #1, it was then
behind, so it immediately moved onto #2. But after that, it was able to
insert a moderate nap time between successive syncs--60 naps is 12
seconds, and it keeps that pace for the remainder of the sync. This is
the same sort of thing I'd worked out as optimal on the system this
patch originated from, except it had a lot more dirty relations; that's
why its naptime was the 3 seconds hard-coded into earlier versions of
this patch.

Results on XFS with mini-server class hardware should be interesting...

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
spread-sync-v5.patch text/x-diff 10.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-02-07 08:37:37 Re: review: FDW API
Previous Message Shigeru HANADA 2011-02-07 07:01:14 Re: SQL/MED - file_fdw