Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-01-18 11:19:15
Message-ID: 4D357733.3090104@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> Idea #4: For ext3 filesystems that like to dump the entire buffer
> cache instead of only the requested file, write a little daemon that
> runs alongside of (and completely indepdently of) PostgreSQL. Every
> 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
> closes the file, thus dumping the cache and preventing a ridiculous
> growth in the amount of data to be sync'd at checkpoint time.
>

Today's data suggests this problem has been resolved in the latest
kernels. I saw the "giant flush/series of small flushes" pattern quite
easily on the CentOS5 system I last did heavy pgbench testing on. The
one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't
show it at all.

Here's what a bad checkpoint looks like on this system:

LOG: checkpoint starting: xlog
DEBUG: checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec
DEBUG: checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec
DEBUG: checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec
DEBUG: checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec
DEBUG: checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec
DEBUG: checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec
DEBUG: checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec
DEBUG: checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec
DEBUG: checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec
DEBUG: checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec
DEBUG: checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec
DEBUG: checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec
DEBUG: checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec
DEBUG: checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec
DEBUG: checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec
DEBUG: checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec
DEBUG: checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec
DEBUG: checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec
DEBUG: checkpoint sync: number=19 file=base/24746/36596.10
time=9759.898 msec
DEBUG: checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727
msec
DEBUG: checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec
DEBUG: checkpoint sync: number=22 file=base/24746/36596.11
time=9153.437 msec
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 1057833 of relation base/24746/36596

[>800 more of these]

DEBUG: checkpoint sync: number=23 file=base/24746/36601.1
time=48697.179 msec
DEBUG: could not forward fsync request because request queue is full
DEBUG: checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec
DEBUG: checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec
DEBUG: checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec
DEBUG: checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 20619 of relation base/24746/36601
DEBUG: checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec
DEBUG: checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 278967 of relation base/24746/36596
DEBUG: could not forward fsync request because request queue is full
CONTEXT: writing block 1582400 of relation base/24746/36596
DEBUG: checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec
DEBUG: checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec
DEBUG: checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec
DEBUG: checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec
DEBUG: checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec
DEBUG: checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec
DEBUG: checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec
DEBUG: checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec
DEBUG: checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec
DEBUG: checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec
LOG: checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log
file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s,
total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s

Here the file that's been brutally delayed via backend contention is
#23, after already seeing quite long delays on the earlier ones. That
I've never seen with earlier kernels running ext3.

This is good in that it makes it more likely a spread sync approach that
works on XFS will also work on these newer kernels with ext4. Then the
only group we wouldn't be able to help if that works the ext3 + old
kernel crowd.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2011-01-18 11:21:09 Re: pg_basebackup for streaming base backups
Previous Message Dimitri Fontaine 2011-01-18 11:03:44 Re: REVIEW: Extensions support for pg_dump