Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spread checkpoint sync
Date: 2011-02-11 03:30:44
Message-ID: 4D54AD64.3010306@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Looks like it's time to close the book on this one for 9.1
development...the unfortunate results are at
http://www.2ndquadrant.us/pgbench-results/index.htm Test set #12 is the
one with spread sync I was hoping would turn out better than #9, the
reference I was trying to improve on. TPS is about 5% slower on the
scale=500 and 15% slower on the scale=1000 tests with sync spread out.
Even worse, maximum latency went up a lot.

I am convinced of a couple of things now:

1) Most of the benefit we were seeing from the original patch I
submitted was simply from doing much better at absorbing fsync requests
from backends while the checkpoint sync was running. The already
committed fsync compaction patch effectively removes that problem
though, to the extent it's possible to do so, making the remaining
pieces here not as useful in its wake.

2) I need to start over testing here with something that isn't 100%
write all of the time the way pgbench is. It's really hard to isolate
out latency improvements when the test program guarantees all associated
write caches will be completely filled at every moment. Also, I can't
see any benefit if I make changes that improve performance only for
readers with it, which is quite unrealistic relative to real-world
workloads.

3) The existing write spreading code in the background writer needs to
be overhauled, too, before spreading the syncs around is going to give
the benefits I was hoping for.

Given all that, I'm going to take my feedback and give the test server a
much deserved break. I'm happy that the fsync compaction patch has made
9.1 much more tolerant of write-heavy loads than earlier versions, so
it's not like no progress was made in this release.

For anyone who wants more details here...the news on this spread sync
implementation is not all bad. If you compare this result from HEAD,
with scale=1000 and clients=256:

http://www.2ndquadrant.us/pgbench-results/611/index.html

Against its identically configured result with spread sync:

http://www.2ndquadrant.us/pgbench-results/708/index.html

There are actually significantly less times in the >2000 ms latency
area. That shows up as a reduction in the 90th percentile latency
figures I compute, and you can see it in the graph if you look at how
much denser the points are in the 2000 - 4000 ms are on #611. But
that's a pretty weak change.

But the most disappointing part here relative to what I was hoping is
what happens with bigger buffer caches. The main idea driving this
approach was that it would enable larger values of shared_buffers
without the checkpoint spikes being as bad. Test set #13 tries that
out, by increasing shared_buffers from 256MB to 4GB, along with a big
enough increase in checkpoint_segments to make most checkpoints time
based. Not only did smaller scale TPS drop in half, all kinds of bad
things happened to latency. Here's a sample of the sort of
dysfunctional checkpoints that came out of that:

2011-02-10 02:41:17 EST: LOG: checkpoint starting: xlog
2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: estimated segments=22
2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=1
file=base/16384/16768 time=150.008 msec
2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=2
file=base/16384/16749 time=0.002 msec
2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=3
file=base/16384/16749_fsm time=0.001 msec
2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=4
file=base/16384/16761 time=8014.102 msec
2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=5
file=base/16384/16752_vm time=0.002 msec
2011-02-10 02:53:35 EST: DEBUG: checkpoint sync: number=6
file=base/16384/16761.5 time=11739.038 msec
2011-02-10 02:53:37 EST: DEBUG: checkpoint sync: number=7
file=base/16384/16761.6 time=2205.721 msec
2011-02-10 02:53:45 EST: DEBUG: checkpoint sync: number=8
file=base/16384/16761.2 time=8273.849 msec
2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=9
file=base/16384/16766 time=20874.167 msec
2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=10
file=base/16384/16762 time=0.002 msec
2011-02-10 02:54:08 EST: DEBUG: checkpoint sync: number=11
file=base/16384/16761.3 time=2440.441 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=12
file=base/16384/16766.1 time=635.839 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=13
file=base/16384/16752_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=14
file=base/16384/16764 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=15
file=base/16384/16768_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=16
file=base/16384/16761_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=17
file=base/16384/16761.4 time=150.702 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=18
file=base/16384/16752 time=0.002 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=19
file=base/16384/16761_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=20
file=base/16384/16749_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=21
file=base/16384/16385 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=22
file=base/16384/16761.1 time=175.575 msec
2011-02-10 02:54:10 EST: LOG: checkpoint complete: wrote 242614 buffers
(46.3%); 0 transaction log file(s) added, 0 removed, 34 recycled;
write=716.637 s, sync=54.659 s, total=772.976 s; sync files=22,
longest=20.874 s, average=2.484 s

That's 12 minutes for the write phase, even though checkpoints should be
happening every 5 minutes here. With that bad of a write phase overrun,
spread sync had no room to work, so no net improvement at all. What is
happening here is similar to the behavior I described seeing on my
client system but didn't have an example to share until now. During the
write phase, looking at "Dirty:" in /proc/meminfo showed the value
peaking at over 1GB while writes were happening, and eventually the
background writer process wasn't getting any serious CPU time compared
to the backends; this is what it looked like via ps:

%CPU %MEM TIME+ COMMAND
4 0 01:51.28 /home/gsmith/pgwork/inst/spread-sync/bin/pgbench
-f /home/gsmith/pgbench-tools
2 8.1 00:39.71 postgres: gsmith pgbench ::1(43871) UPDATE
2 8 00:39.28 postgres: gsmith pgbench ::1(43875) UPDATE
2 8.1 00:39.92 postgres: gsmith pgbench ::1(43865) UPDATE
2 8.1 00:39.54 postgres: gsmith pgbench ::1(43868) UPDATE
2 8 00:39.36 postgres: gsmith pgbench ::1(43870) INSERT
2 8.1 00:39.47 postgres: gsmith pgbench ::1(43877) UPDATE
1 8 00:39.39 postgres: gsmith pgbench ::1(43864) COMMIT
1 8.1 00:39.78 postgres: gsmith pgbench ::1(43866) UPDATE
1 8 00:38.99 postgres: gsmith pgbench ::1(43867) UPDATE
1 8.1 00:39.55 postgres: gsmith pgbench ::1(43872) UPDATE
1 8.1 00:39.90 postgres: gsmith pgbench ::1(43873) UPDATE
1 8.1 00:39.64 postgres: gsmith pgbench ::1(43876) UPDATE
1 8.1 00:39.93 postgres: gsmith pgbench ::1(43878) UPDATE
1 8.1 00:39.83 postgres: gsmith pgbench ::1(43863) UPDATE
1 8 00:39.47 postgres: gsmith pgbench ::1(43869) UPDATE
1 8.1 00:40.11 postgres: gsmith pgbench ::1(43874) UPDATE
1 0 00:11.91 [flush-9:1]
0 0 27:43.75 [xfsdatad/6]
0 9.4 00:02.21 postgres: writer process

I want to make this problem go away, but as you can see spreading the
sync calls around isn't enough. I think the main write loop needs to
get spread out more, too, so that the background writer is trying to
work at a more reasonable pace. I am pleased I've been able to
reproduce this painful behavior at home using test data, because that
much improves my odds of being able to isolate its cause and test
solutions. But it's a tricky problem, and I'm certainly not going to
fix it in the next week.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-02-11 04:01:33 Re: Spread checkpoint sync
Previous Message Tom Lane 2011-02-11 03:18:38 Re: [PERFORM] pgbench to the MAXINT