Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-02-04 19:08:07
Message-ID: 4D4C4E97.7000106@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As already mentioned in the broader discussion at
http://archives.postgresql.org/message-id/4D4C4610.1030109@2ndquadrant.com
, I'm seeing no solid performance swing in the checkpoint sorting code
itself. Better sometimes, worse others, but never by a large amount.

Here's what the statistics part derived from the sorted data looks like
on a real checkpoint spike:

2011-02-04 07:02:51 EST: LOG: checkpoint starting: xlog
2011-02-04 07:02:51 EST: DEBUG: BufferSync 10 dirty blocks in
relation.segment_fork 17216.0_2
2011-02-04 07:02:51 EST: DEBUG: BufferSync 159 dirty blocks in
relation.segment_fork 17216.0_1
2011-02-04 07:02:51 EST: DEBUG: BufferSync 10 dirty blocks in
relation.segment_fork 17216.3_0
2011-02-04 07:02:51 EST: DEBUG: BufferSync 548 dirty blocks in
relation.segment_fork 17216.4_0
2011-02-04 07:02:51 EST: DEBUG: BufferSync 808 dirty blocks in
relation.segment_fork 17216.5_0
2011-02-04 07:02:51 EST: DEBUG: BufferSync 799 dirty blocks in
relation.segment_fork 17216.6_0
2011-02-04 07:02:51 EST: DEBUG: BufferSync 807 dirty blocks in
relation.segment_fork 17216.7_0
2011-02-04 07:02:51 EST: DEBUG: BufferSync 716 dirty blocks in
relation.segment_fork 17216.8_0
2011-02-04 07:02:51 EST: DEBUG: BufferSync 3857 buffers to write, 8
total dirty segment file(s) expected to need sync
2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=1
file=base/16384/17216.5 time=1324.614 msec
2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=2
file=base/16384/17216.4 time=0.002 msec
2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=3
file=base/16384/17216_fsm time=0.001 msec
2011-02-04 07:03:47 EST: DEBUG: checkpoint sync: number=4
file=base/16384/17216.10 time=16446.753 msec
2011-02-04 07:03:53 EST: DEBUG: checkpoint sync: number=5
file=base/16384/17216.8 time=5804.252 msec
2011-02-04 07:03:53 EST: DEBUG: checkpoint sync: number=6
file=base/16384/17216.7 time=0.001 msec
2011-02-04 07:03:54 EST: DEBUG: compacted fsync request queue from
32768 entries to 2 entries
2011-02-04 07:03:54 EST: CONTEXT: writing block 1642223 of relation
base/16384/17216
2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=7
file=base/16384/17216.11 time=6350.577 msec
2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=8
file=base/16384/17216.9 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=9
file=base/16384/17216.6 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=10
file=base/16384/17216.3 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=11
file=base/16384/17216_vm time=0.001 msec
2011-02-04 07:04:00 EST: LOG: checkpoint complete: wrote 3813 buffers
(11.6%); 0 transaction log file(s) added, 0 removed, 64 recycled;
write=39.073 s, sync=29.926 s, total=69.003 s; sync files=11,
longest=16.446 s, average=2.720 s

You can see that it ran out of fsync absorption space in the middle of
the sync phase, which is usually when compaction is needed, but the
recent patch to fix that kicked in and did its thing.

Couple of observations:

-The total number of buffers I'm computing based on the checkpoint
writes being sorted it not a perfect match to the number reported by the
"checkpoint complete" status line. Sometimes they are the same,
sometimes not. Not sure why yet.

-The estimate for "expected to need sync" computed as a by-product of
the checkpoint sorting is not completely accurate either. This
particular one has a fairly large error in it, percentage-wise, being
off by 3 with a total of 11. Presumably these are absorbed fsync
requests that were already queued up before the checkpoint even
started. So any time estimate I drive based off of this count is only
going to be approximate.

-The order in which the sync phase processes files is unrelated to the
order in which they are written out. Note that 17216.10 here, the
biggest victim (cause?) of the I/O spike, isn't even listed among the
checkpoint writes!

The fuzziness here is a bit disconcerting, and I'll keep digging for why
it happens. But I don't see any reason not to continue forward using
the rough count here to derive a nap time from, which I can then feed
into the "useful leftovers" patch that Robert already refactored here.
Can always sharpen up that estimate later, I need to get some solid
results I can share on what the delay time does to the
throughput/latency pattern next.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-02-04 19:08:32 Use a separate pg_depend.deptype for extension membership?
Previous Message Hitoshi Harada 2011-02-04 18:54:42 Re: Add ENCODING option to COPY