Quick Links

Re: Load distributed checkpoint V3

From:	"Takayuki Tsunakawa" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	<pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load distributed checkpoint V3
Date:	2007-04-06 03:20:30
Message-ID:	00ce01c777fa$877b1bb0$19527c0a@OPERAO
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-patches

Hello, long time no see.

I'm sorry to interrupt your discussion. I'm afraid the code is getting
more complicated to continue to use fsync(). Though I don't intend to
say the current approach is wrong, could anyone evaluate O_SYNC
approach again that commercial databases use and tell me if and why
PostgreSQL's fsync() approach is better than theirs?

This January, I got a good result with O_SYNC, which I haven't
reported here. I'll show it briefly. Please forgive me for my abrupt
email, because I don't have enough time.
# Personally, I want to work in the community, if I'm allowed.
And sorry again. I reported that O_SYNC resulted in very bad
performance last year. But that was wrong. The PC server I borrowed
was configured that all the disks form one RAID5 device. So, the disks
for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5
device, resulting in I/O conflict.

What I modified is md.c only. I just added O_SYNC to the open flag in
mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want
backends to use O_SYNC because mdextend() does not have to transfer
data to disk.

My evaluation environment was:

CPU: Intel Xeon 3.2GHz * 2 (HT on)
Memory: 4GB
Disk: Ultra320 SCSI (perhaps configured as write back)
OS: RHEL3.0 Update 6
Kernel: 2.4.21-37.ELsmp
PostgreSQL: 8.2.1

The relevant settings of PostgreSQL are:

shared_buffers = 2GB
wal_buffers = 1MB
wal_sync_method = open_sync
checkpoint_* and bgwriter_* parameters are left as their defaults.

I used pgbench, with the data of scaling factor 50.

[without O_SYNC, original behavior]
- pgbench -c1 -t16000
best response: 1ms
worst response: 6314ms
10th worst response: 427ms
tps: 318
- pgbench -c32 -t500
best response: 1ms
worst response: 8690ms
10th worst response: 8668ms
tps: 330

[with O_SYNC]
- pgbench -c1 -t16000
best response: 1ms
worst response: 350ms
10th worst response: 91ms
tps: 427
- pgbench -c32 -t500
best response: 1ms
worst response: 496ms
10th worst response: 435ms
tps: 1117

If the write back cache were disabled, the difference would be
smaller.
Windows version showed similar improvements.

However, this approach has two big problems.

(1) Slow down bulk updates

Updates of large amount of data get much slower because bgwriter seeks
and writes dirty buffers synchronously page-by-page. For example:

- COPY of accounts (5m records) and CHECKPOINT command after COPY
without O_SYNC: 100sec
with O_SYNC: 1046sec
- UPDATE of all records of accounts
without O_SYNC: 139sec
with O_SYNC: 639sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
without O_SYNC: 24sec
with O_SYNC: 126sec

To mitigate this problem, I sorted dirty buffers by their relfilenode
and block numbers and wrote multiple pages that are adjacent both on
memory and on disk. The result was:

- COPY of accounts (5m records) and CHECKPOINT command after COPY

227sec
- UPDATE of all records of accounts
569sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
71sec

Still bad...

(2) Can't utilize tablespaces

Though I didn't evaluate, update activity would be much less efficient
with O_SYNC than with fsync() when using multiple tablespaces, because
there is only one bgwriter.

Anyone can solve these problems?
One of my ideas is to use scattered I/O. I hear that readv()/writev()
became able to do real scattered I/O since kernel 2.6 (RHEL4.0). With
kernels before 2.6, readv()/writev() just performed I/Os sequentially.
Windows has provided reliable scattered I/O for years.

Another idea is to use async I/O, possibly combined with multiple
bgwriter approach on platforms where async I/O is not available. How
about the chance Josh-san has brought?

In response to

Re: Load distributed checkpoint V3 at 2007-04-05 12:58:14 from Heikki Linnakangas

Responses

Re: Load distributed checkpoint V3 at 2007-04-06 06:53:17 from Greg Smith

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Koichi Suzuki	2007-04-06 04:13:09	Re: [HACKERS] Full page writes improvement, code update
Previous Message	Tatsuo Ishii	2007-04-06 02:35:30	Re: Optimized pgbench for 8.3

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Koichi Suzuki	2007-04-06 04:13:09	Re: [HACKERS] Full page writes improvement, code update
Previous Message	Tatsuo Ishii	2007-04-06 02:35:30	Re: Optimized pgbench for 8.3