Direct I/O issues

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Direct I/O issues
Date: 2006-11-23 06:30:24
Message-ID: Pine.GSO.4.64.0611230013550.26031@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches pgsql-performance

I've been trying to optimize a Linux system where benchmarking suggests
large performance differences between the various wal_sync_method options
(with o_sync being the big winner). I started that by using
src/tools/fsync/test_fsync to get an idea what I was dealing with (and to
spot which drives had write caching turned on). Since those results
didn't match what I was seeing in the benchmarks, I've been browsing the
backend source to figure out why. I noticed test_fsync appears to be,
ahem, out of sync with what the engine is doing.

It looks like V8.1 introduced O_DIRECT writes to the WAL, determined at
compile time by a series of preprocessor tests in
src/backend/access/transam/xlog.c When O_DIRECT is available,
O_SYNC/O_FSYNC/O_DSYNC writes use it. test_fsync doesn't do that.

I moved the new code (in 8.2 beta 3, lines 61-92 in xlog.c) into
test_fsync; all the flags had the same name so it dropped right in. You
can get the version I made at http://www.westnet.com/~gsmith/test_fsync.c
(fixed a compiler warning, too)

The results I get now look fishy. I'm not sure if I screwed up a step, or
if I'm seeing a real problem. The system here is running RedHat Linux,
RHEL ES 4.0 kernel 2.6.9, and the disk I'm writing to is a standard
7200RPM IDE drive. I turned off write caching with hdparm -W 0

Here's an excerpt from the stock test_fsync:

Compare one o_sync write to two:
one 16k o_sync write 8.717944
two 8k o_sync writes 17.501980

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 17.018495
write, fdatasync 8.842473
write, fsync, 8.809117

And here's the version I tried to modify to include O_DIRECT support:

Compare one o_sync write to two:
one 16k o_sync write 0.004995
two 8k o_sync writes 0.003027

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.004978
write, fdatasync 8.845498
write, fsync, 8.834037

Obivously the o_sync writes aren't waiting for the disk. Is this a
problem with O_DIRECT under Linux? Or is my code just not correctly
testing this behavior?

Just as a sanity check, I did try this on another system, running SuSE
with drives connected to a cciss SCSI device, and I got exactly the same
results. I'm concerned that Linux users who use O_SYNC because they
notice it's faster will be losing their WAL integrity without being aware
of the problem, especially as the whole O_DIRECT business isn't even
mentioned in the WAL documentation--it really deserves to be brought up in
the wal_sync_method notes at
http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

And while I'm mentioning improvements to that particular documentation
page...the wal_buffers notes there are so sparse they misled me initially.
They suggest only bumping it up for situations with very large
transactions; since I was testing with small ones I left it woefully
undersized initially. I would suggest copying the text from
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html to
here: "When full_page_writes is set and the system is very busy, setting
this value higher will help smooth response times during the period
immediately following each checkpoint." That seems to match what I found
in testing.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2006-11-23 07:42:16 Re: 8.2 open items list
Previous Message Tom Lane 2006-11-23 06:24:27 Re: 8.2 open items list

Browse pgsql-patches by date

  From Date Subject
Next Message Alban Hertroys 2006-11-23 08:44:14 Re: ISO week dates
Previous Message Bruce Momjian 2006-11-23 05:16:14 Re: large object regression tests, take two

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2006-11-23 07:31:22 Re: Lying drives [Was: Re: Which OS provides the _fastest_
Previous Message Luke Lonergan 2006-11-22 21:47:56 Re: availability of SATA vendors