Re: Load distributed checkpoint

From: "Takayuki Tsunakawa" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
To: "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Load distributed checkpoint
Date: 2006-12-20 11:10:56
Message-ID: 037901c72427$859864f0$19527c0a@OPERAO
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Hello, Itagaki-san, all

I have to report a sad result. Your patch didn't work. Let's
consider the solution together. What you are addressing is very
important for the system designers in the real world -- smoothing
response time.

Recall that unpatched PostgreSQL showed the following tps's in case
(1) (i.e. with default bgwriter_* and checkpoint_* settings.)

235 80 226 77 240

The patched PostgreSQL showed the following tps's:

230 228 77 209 66

[disk usage]
The same tendency can be seen as with the unpatched PostgreSQL. That
is:
When the tps is low, the %util of disk for data files is high, and
%util of disk for WAL is low. Why is transaction logging is disturbed
by cleaning and/or syncing activity? While the bgwriter is
fsync()ing, it does not lock any data structures that the transactions
want to access. Even though they share the same SCSI controller and
bus, they are different disks. The bandwidth does not appear to be
exhausted, since Ultra320 is said to have 256MB band width in
practice.
(Recall that WAL is on sdd and data files are on sde.)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sdd 0.00 810.78 0.00 102.20 0.00 7306.99 0.00
3653.49 71.50 1.12 10.95 7.32 74.77
sde 0.00 25.35 0.00 6.19 0.00 252.30 0.00 126.15
40.77 0.50 81.32 5.94 3.67
sdd 0.00 884.20 0.00 126.00 0.00 8080.00 0.00
4040.00 64.13 1.26 10.00 7.11 89.64
sde 0.00 21.40 0.00 5.00 0.00 211.20 0.00 105.60
42.24 0.31 62.56 6.52 3.26
sdd 0.00 924.80 0.00 116.20 0.00 8326.40 0.00
4163.20 71.66 1.23 10.59 7.37 85.64
sde 0.00 27.60 0.00 26.60 0.00 433.60 0.00 216.80
16.30 4.24 159.29 2.44 6.50
sdd 0.00 721.20 0.00 102.40 0.00 6588.80 0.00
3294.40 64.34 0.99 9.71 7.07 72.40
sde 0.00 1446.80 0.00 101.60 0.00 20289.60 0.00
10144.80 199.70 1192.40 572.45 2.29 23.30
sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80
8.00 0.11 539.00 539.00 10.80
sde 0.00 0.00 0.00 452.10 0.00 0.00 0.00
0.00 0.00 3829.57 3715.83 2.22 100.22
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 349.80 0.00 0.00 0.00
0.00 0.00 1745.52 8515.74 2.86 100.02
sdd 0.00 442.40 0.00 51.00 0.00 3948.80 0.00 1974.40
77.43 0.60 11.73 7.54 38.46
sde 0.00 2.80 0.00 184.00 0.00 25.60 0.00
12.80 0.14 277.52 12629.41 3.19 58.74
sdd 0.00 898.00 0.00 124.80 0.00 8182.40 0.00
4091.20 65.56 1.30 10.40 7.24 90.30
sde 0.00 19.20 0.00 3.80 0.00 184.00 0.00 92.00
48.42 0.24 62.11 14.11 5.36
sdd 0.00 842.28 0.00 109.02 0.00 7612.02 0.00
3806.01 69.82 1.33 12.26 8.35 91.02
sde 0.00 45.49 0.00 46.89 0.00 739.08 0.00 369.54
15.76 9.04 192.73 3.38 15.85
sdd 0.00 1198.41 0.00 71.51 0.00 10505.18 0.00
5252.59 146.90 128.19 99.76 13.48 96.43
sde 0.00 1357.77 0.00 199.80 0.00 19263.75 0.00
9631.87 96.41 2251.09 1179.42 2.39 47.81
sdd 0.00 0.00 0.00 7.20 0.00 0.00 0.00 0.00
0.00 203.87 5671.83 138.92 100.02
sde 0.00 0.00 0.00 409.60 0.00 0.00 0.00
0.00 0.00 3171.04 4779.83 2.44 100.02
sdd 0.00 0.00 0.00 17.80 0.00 0.00 0.00 0.00
0.00 137.87 10240.90 56.19 100.02
sde 0.00 0.00 0.00 240.60 0.00 0.00 0.00
0.00 0.00 1573.85 9815.29 4.16 100.02
sdd 0.00 109.80 0.00 35.40 0.00 1012.80 0.00 506.40
28.61 42.14 7974.47 27.86 98.64
sde 0.00 2.80 0.00 198.80 0.00 30.40 0.00
15.20 0.15 428.49 14474.39 4.30 85.56
sdd 0.00 466.20 0.00 62.80 0.00 4230.40 0.00 2115.20
67.36 0.59 9.49 6.79 42.62
sde 0.00 5.20 0.00 0.80 0.00 48.00 0.00 24.00
60.00 0.01 16.25 11.25 0.90
sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80
8.00 0.01 35.00 35.00 0.70
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00

I suspect that fsync() is the criminal as I've been afraid. I'll show
you an interesting data. I collected the stack traces of backend
processes while a checkpoint is happening.

[bgwriter]
Oh, he is fsync()ing hard.

#0 0x0000003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1 0x00000000005742a1 in mdsync ()
#2 0x00000000005753d7 in smgrsync ()
#3 0x0000000000564d65 in FlushBufferPool ()
...

[some backends]
They are forced to wait for some lock pertaining to WAL when they try
to insert a log record.

#0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6
#1 0x000000000054c9c6 in PGSemaphoreLock ()
#2 0x0000000000572374 in LWLockAcquire ()
#3 0x000000000046ab10 in XLogInsert ()

[some backends]
They are waiting for some lock when they commit.

#0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6
#1 0x000000000054c9c6 in PGSemaphoreLock ()
#2 0x0000000000572374 in LWLockAcquire ()
#3 0x000000000046a746 in XLogFlush ()
#4 0x000000000046228e in RecordTransactionCommit ()

[the second criminal backend]
He is making other backends wait for WALWRITE lock, because he is
doing a long-time operation - fdatasync() - with the WALWRITE lock
being held.
But I think he is also a victim. He is probably forced to wait by
bgwriter's fsync() activity.

#0 0x0000003a629bfc59 in fdatasync () from /lib64/tls/libc.so.6
#1 0x0000000000469ac0 in issue_xlog_fsync ()
#2 0x000000000046a210 in XLogWrite ()
#3 0x000000000046a7df in XLogFlush ()
#4 0x000000000046228e in RecordTransactionCommit ()

The following stack trace sometimes appears as the second criminal
instead of the above:

#0 0x0000003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1 0x000000000046530e in UpdateControlFile ()
#2 0x000000000046a05d in XLogWrite ()
#3 0x000000000046a7df in XLogFlush ()
#4 0x000000000046228e in RecordTransactionCommit ()

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync(). We can't ignore what commercial databases
have done so far. The kernel does as much as he likes when PostgreSQL
requests him to fsync().
One question is the disk utilization. While bgwriter is fsync()ing,
%util of WAL disk drops to almost 0. But the the bandwidth of
Ultra320 SCSI does not appear to be used fully. Any idea?
So what should we do next?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2006-12-20 11:17:05 Re: Load distributed checkpoint
Previous Message Martijn van Oosterhout 2006-12-20 11:03:11 Re: [HACKERS] Enums patch v2

Browse pgsql-patches by date

  From Date Subject
Next Message Martijn van Oosterhout 2006-12-20 11:17:05 Re: Load distributed checkpoint
Previous Message Martijn van Oosterhout 2006-12-20 11:03:11 Re: [HACKERS] Enums patch v2