Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Spread checkpoint sync
Date: 2010-11-14 23:48:24
Message-ID: 4CE07548.4030709@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Final patch in this series for today spreads out the individual
checkpoint fsync calls over time, and was written by myself and Simon
Riggs. Patch is based against a system that's already had the two
patches I sent over earlier today applied, rather than HEAD, as both are
useful for measuring how well this one works. You can grab a tree with
all three from my Github repo, via the "checkpoint" branch:
https://github.com/greg2ndQuadrant/postgres/tree/checkpoint

This is a work in progress. While I've seen this reduce checkpoint
spike latency significantly on a large system, I don't have any
referencable performance numbers I can share yet. There are also a
couple of problems I know about, and I'm sure others I haven't thought
of yet The first known issues is that it delays manual or other
"forced" checkpoints, which is not necessarily wrong if you really are
serious about spreading syncs out, but it is certainly surprising when
you run into it. I notice this most when running createdb on a busy
system. No real reason for this to happen, the code passes that it's a
forced checkpoint down but just doesn't act on it yet.

The second issue is that the delay between sync calls is currently
hard-coded, at 3 seconds. I believe the right path here is to consider
the current checkpoint_completion_target to still be valid, then work
back from there. That raises the question of what percentage of the
time writes should now be compressed into relative to that, to leave
some time to spread the sync calls. If we're willing to say "writes
finish in first 1/2 of target, syncs execute in second 1/2", that I
could implement that here. Maybe that ratio needs to be another
tunable. Still thinking about that part, and it's certainly open to
community debate. The thing to realize that complicates the design is
that the actual sync execution may take a considerable period of time.
It's much more likely for that to happen than in the case of an
individual write, as the current spread checkpoint does, because those
are usually cached. In the spread sync case, it's easy for one slow
sync to make the rest turn into ones that fire in quick succession, to
make up for lost time.

There's some history behind this design that impacts review. Circa 8.3
development in 2007, I had experimented with putting some delay between
each of the fsync calls that the background writer executes during a
checkpoint. It didn't help smooth things out at all at the time. It
turns out that's mainly because all my tests were on Linux using ext3.
On that filesystem, fsync is not very granular. It's quite likely it
will push out data you haven't asked to sync yet, which means one giant
sync is almost impossible to avoid no matter how you space the fsync
calls. If you try and review this on ext3, I expect you'll find a big
spike early in each checkpoint (where it flushes just about everything
out) and then quick response for the later files involved.

The system this patch originated to help fix was running XFS. There,
I've confirmed that problem doesn't exist, that individual syncs only
seem to push out the data related to one file. The same should be true
on ext4, but I haven't tested that myself. Not sure how granular the
fsync calls are on Solaris, FreeBSD, Darwin, etc. yet. Note that it's
still possible to get hung on one sync call for a while, even on XFS.
The worst case seems to be if you've created a new 1GB database table
chunk and fully populated it since the last checkpoint, on a system
that's just cached the whole thing so far.

One change that turned out be necessary rather than optional--to get
good performance from the system under tuning--was to make regular
background writer activity, including fsync absorb checks, happen during
these sync pauses. The existing code ran the checkpoint sync work in a
pretty tight loop, which as I alluded to in an earlier patch today can
lead to the backends competing with the background writer to get their
sync calls executed. This squashes that problem if the background
writer is setup properly.

What does properly mean? Well, it can't do that cleanup if the
background writer is sleeping. This whole area was refactored. The
current sync absorb code uses the constant WRITES_PER_ABSORB to make
decisions. This new version replaces that hard-coded value with
something that scales to the system size. It now ignores doing work
until the number of pending absorb requests has reached 10% of the
number possible to store (BgWriterShmem->max_requests, which is set to
the size of shared_buffers in 8K pages, AKA NBuffers). This may
actually postpone this work for too long on systems with large
shared_buffers settings; that's one area I'm still investigating.

As far as concerns about this 10% setting not doing enough work, which
is something I do see, you can always increase how often absorbing
happens by decreasing bgwriter_delay now--giving other benefits too.
For example, if you run the fsync-stress-v2.sh script I included with
the last patch I sent, you'll discover the spread sync version of the
server leaves just as many unabsorbed writes behind as the old code
did. Those are happening because of periods the background writer is
sleeping. They drop as you decrease the delay; here's a table showing
some values I tested here, with all three patches installed:

bgwriter_delay buffers_backend_sync
200 ms 90
50 ms 28
25 ms 3

There's a bunch of performance related review work that needs to be done
here, in addition to the usual code review for the patch. My hope is
that I can get enough of that done to validate this does what it's
supposed to on public hardware that a later version of this patch is
considered for the next CommitFest. It's a little more raw than I'd
like still, but the idea has been tested enough here that I believe it's
fundamentally sound and valuable.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

Attachment Content-Type Size
sync-spread-v2.patch text/x-patch 7.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joachim Wieland 2010-11-14 23:52:55 WIP patch for parallel pg_dump
Previous Message Joachim Wieland 2010-11-14 23:48:07 directory archive format for pg_dump