On Fri, 2005-06-03 at 10:37 +1000, Neil Conway wrote:
> On Thu, 2005-06-02 at 11:49 -0700, Mary Edie Meredith wrote:
> > My understanding is that O_DIRECT means "direct" as in "no buffering by
> > the OS" which implies that if you write from your buffer, the write is
> > not going to return unless the OS thinks the write is completed
> Right, I think that's definitely the case. The question is whether a
> write() under O_DIRECT will also flush the disk's write cache -- i.e.
> when the write() completes, we need it to be durable over a spontaneous
> power loss. fsync() or O_SYNC should provide this (modulo braindamaged
> IDE hardware), but I wouldn't be surprised if O_DIRECT by itself will
> not (otherwise you would hurt the performance of applications using
> O_DIRECT that don't need these durability guarantees).
My understanding is that for Linux, with respect to "Guaranteed writes"
a write with the fd opened as O_DIRECT behaves the _same as a
write/fsync on an fd opened without O_DIRECT, i.e. whether the write
completes all the way to the disk itself depends on when the particular
device responds to those equivalent sequences.
Quoting from the Capabilities Document "'Guarantee a write completion '
means the operating system has issued a write to the I/O subsystem, and
the device has returned an affirmative response. Once an affirmative
response is sent, recovery from power down without data loss is the
responsibility of the I/O subsystem." Don't most disk drives have a
battery backup so that it can flush its cache if power is lost? Ditto
for Disk arrays with fancier cache and write-back set on (not advised
for the paranoid).
Looking at this from another angle, is there really any way that you can
say a write is truly guaranteed in the event of a failure? I think in
the end to be safe, you cannot. That's why (and I'm not telling you
anything new) there is no substitute for backups and log archiving for
databases. Databases must be able to recognize the last _good
transaction logged and roll forward to that from the backup (including
detecting partial writes to the log). I'm sure the PostgreSQL community
has worked hard to do the equivalent of that within the PostgreSQL
> > Bottom line: if you do not implement direct/async IO so that you
> > optimize caching of hot database objects and minimize memory utilization
> > of objects used once, you are probably leaving performance on the table
> > for datafiles.
> Absolutely -- patches are welcome :)
How about testing patches (--:
> I agree async IO + O_DIRECT in some
> form would be interesting, but the changes required are far from trivial
> -- my guess is there are lower hanging fruit.
Since the log has to be sequential, I think you are on the right track!
Believe me, I didn't mean to imply that it is trivial to implement. For
those databases that have async/direct, the functionality appeared over
a span of several major versions. I just thought I detected an opinion
that it would not help. Sorry for the misunderstanding. I absolutely
don't mean to sound critical. At OSDL we have the greatest respect for
the PostgreSQL community.
Mary Edie Meredith
Data Center Linux Initiative Manager
Open Source Development Labs
In response to
pgsql-hackers by date
|Next:||From: Alon Goldshuv||Date: 2005-06-03 16:57:00|
|Subject: Re: NOLOGGING option, or ?|
|Previous:||From: David Fetter||Date: 2005-06-03 16:30:46|
|Subject: Re: PostgreSQL Developer Network|
pgsql-patches by date
|Next:||From: Bruno Wolff III||Date: 2005-06-03 19:24:51|
|Subject: Re: O_DIRECT for WAL writes|
|Previous:||From: Simon Riggs||Date: 2005-06-03 08:00:20|
|Subject: Re: Tablespaces|