Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2010-11-30 20:29:57
Message-ID: 4CF55EC5.5000108@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Ron Mayer wrote:
> Might smoother checkpoints be better solved by talking
> to the OS vendors & virtual-memory-tunning-knob-authors
> to work with them on exposing the ideal knobs; rather than
> saying that our only tool is a hammer(fsync) so the problem
> must be handled as a nail.
>

Maybe, but it's hard to argue that the current implementation--just
doing all of the sync calls as fast as possible, one after the other--is
going to produce worst-case behavior in a lot of situations. Given that
it's not a huge amount of code to do better, I'd rather do some work in
that direction, instead of presuming the kernel authors will ever make
this go away. Spreading the writes out as part of the checkpoint rework
in 8.3 worked better than any kernel changes I've tested since then, and
I'm not real optimisic about this getting resolved at the system level.
So long as the database changes aren't antagonistic toward kernel
improvements, I'd prefer to have some options here that become effective
as soon as the database code is done.

I've attached an updated version of the initial sync spreading patch
here, one that applies cleanly on top of HEAD and over top of the sync
instrumentation patch too. The conflict that made that hard before is
gone now.

Having the pg_stat_bgwriter.buffers_backend_fsync patch available all
the time now has made me reconsider how important one potential bit of
refactoring here would be. I managed to catch one of the situations
where really popular relations were being heavily updated in a way that
was competing with the checkpoint on my test system (which I can happily
share the logs of), with the instrumentation patch applied but not the
spread sync one:

LOG: checkpoint starting: xlog
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 7747 of relation base/16424/16442
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 42688 of relation base/16424/16437
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 9723 of relation base/16424/16442
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 58117 of relation base/16424/16437
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm
time=10132.830000 msec
DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000
msec
DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm
time=0.001000 msec
DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec
DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm
time=0.001000 msec
DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm
time=0.001000 msec
LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s,
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s

Note here how the checkpoint was hung on trying to get 16448_fsm written
out, but the backends were issuing constant competing fsync calls to
these other relations. This is very similar to the production case this
patch was written to address, which I hadn't been able to share a good
example of yet. That's essentially what it looks like, except with the
contention going on for minutes instead of seconds.

One of the ideas Simon and I had been considering at one point was
adding some better de-duplication logic to the fsync absorb code, which
I'm reminded by the pattern here might be helpful independently of other
improvements.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
sync-spread-v3.patch text/x-patch 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-11-30 20:34:42 Re: DELETE with LIMIT (or my first hack)
Previous Message Alastair Turner 2010-11-30 20:26:11 Re: DELETE with LIMIT (or my first hack)