Re: O_DIRECT for WAL writes

From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 08:04:48
Message-ID: 429AC920.6080809@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Neil Conway <neilc(at)samurai(dot)com> writes:
>>is opening a file with O_DIRECT sufficient to ensure that
>>a write(2) does not return until the data has hit disk?
>
> Some googling suggests so, eg
> http://www.die.net/doc/linux/man/man2/open.2.html

Really? On that page I read:
"O_DIRECT...at the completion of the read(2) or write(2)
system call, data is guaranteed to have been transferred."
which sounds to me like transfered to the device's cache
but not necessarily flushed through the device's cache.
It says nothing about physical media. That wording feels
different to me from O_SYNC which reads:
"O_SYNC will block the calling process until the data has
been physically written to the underlying hardware."
which does suggest to me that it writes to physical media.
Or am I reading that wrong?

PS: I've gotten way out of my depth here, but...

...attempting to browse the Linux source(!!)

Looking at the O_SYNC stuff in ext3:
http://lxr.linux.no/source/fs/ext3/file.c#L67
it looks like in this conditional:
if (file->f_flags & O_SYNC) {
...
goto force_commit;
}
the goto branch calls ext3_force_commit() in much the
same way that it seems fsync() does here:
http://lxr.linux.no/source/fs/ext3/fsync.c#L71
so I believe O_SYNC does at least as much as fsync().

However I can't find O_DIRECT anywhere in the ext3 stuff,
so if it does work it's less obvious how or if it could.

Moreover I see O_SYNC used lots of places:
http://lxr.linux.no/ident?i=O_SYNC
in various places like fs/ext3/; and and I don't
see O_DIRECT in nearly as many places
http://lxr.linux.no/ident?i=O_DIRECT
It looks like reiserfs and xfs seem look at O_DIRECT,
but ext3 doesn't appear to unless it's somewhere
outside the fs/ext3 directory.

PPS: Of course not even fsync() flushed correctly until very recent kernels:
http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
In that article Jeff Garzik (the linux SATA driver guy) suggests
that until very recent kernels ext3 did not have write barrier
support that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE
(SCSI) commands even on fsync.

PPPS: No, I don't understand the kernel - I'm just showing what quick
grep commands showed without any deep understanding.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2005-05-30 09:21:41 Re: compiling postgres with Visual Age compiler on
Previous Message Zeugswetter Andreas DAZ SD 2005-05-30 08:01:56 Re: compiling postgres with Visual Age compiler on OpenPower5 / Linux

Browse pgsql-patches by date

  From Date Subject
Next Message Peter Eisentraut 2005-05-30 09:26:44 Re: Escape handling in COPY, strings, psql
Previous Message chasidy hunter 2005-05-30 07:43:52 male performance system