Skip site navigation (1) Skip section navigation (2)

Re: O_DIRECT for WAL writes

From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>,pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-02 00:08:14
Message-ID: 1117670894.2922.339.camel@localhost (view raw, whole thread or download thread mbox)
Lists: pgsql-hackerspgsql-patches
On Mon, 2005-05-30 at 16:29 +1000, Neil Conway wrote:
> On Mon, 2005-05-30 at 10:59 +0900, ITAGAKI Takahiro wrote:
> > Yes, I've tested pgbench and dbt2 and their performances have improved.
> > The two results are as follows:
> > 
> > 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> >    (attached image)
> >   tps  | wal_sync_method
> > -------+-------------------------------------------------------
> >  147.0 | open_direct + write multipage (previous patch)
> >  147.2 | open_direct (this patch)
> >  109.9 | open_sync
> I'm surprised this makes as much of a difference as that benchmark would
> suggest. I wonder if we're benchmarking the right thing, though: is
> opening a file with O_DIRECT sufficient to ensure that a write(2) does
> not return until the data has hit disk? (As would be the case with
> O_SYNC.) O_DIRECT means the OS will attempt to minimize caching, but
> that is not necessarily the same thing: for example, I can imagine an
> implementation in which the kernel would submit the appropriate I/O to
> the disk when it sees a write(2) on a file opened with O_DIRECT, but
> then let the write(2) return before getting confirmation from the disk
> that the I/O has succeeded or failed. From googling, the MySQL
> documentation for innodb_flush_method notes:
>         This option is only relevant on Unix systems. If set to
>         fdatasync, InnoDB uses fsync() to flush both the data and log
>         files. If set to O_DSYNC, InnoDB uses O_SYNC to open and flush
>         the log files, but uses fsync() to flush the datafiles. If
>         O_DIRECT is specified (available on some GNU/Linux versions
>         starting from MySQL 4.0.14), InnoDB uses O_DIRECT to open the
>         datafiles, and uses fsync() to flush both the data and log
>         files.
> That would suggest O_DIRECT by itself is not sufficient to force a flush
> to disk -- if anyone has some more definitive evidence that would be
> welcome.

I know I'm late to this discussion, and I haven't made it all the way
through this thread to see if your questions on Linux writes were
resolved.   If you are still interested, I recommend read a very good
one page description of reliable writes buried in the Data Center Linux
Goals and Capabilities document.  It is on page 159 of the document, the
item is "R.ReliableWrites" in this _giant PDF file (do a wget and open
it locally ;  don't try to read it directly):

The information came from me interviewing Daniel McNeil, an OSDL
Engineer who wrote and tested much of the Linux async IO code, after I
was similarly confused about when a write is "guaranteed".   Reliable
writes, as you can imagine, are very important to Data Center folks,
which is how it happens to be in this document.

Hope this helps.
> Anyway, if the above is true, we'll need to use O_DIRECT as well as one
> of the existing wal_sync_methods.
> BTW, from the patch:
> + /* TODO: Aligment depends on OS and filesystem. */
> + #define O_DIRECT_BUFFER_ALIGN	4096
> I suppose there's no reasonable way to autodetect this, so we'll need to
> expose it as a GUC variable (or perhaps a configure option), which is a
> bit unfortunate.
> -Neil
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
Mary Edie Meredith 
Data Center Linux Initiative Manager
Open Source Development Labs

In response to


pgsql-hackers by date

Next:From: Bruce MomjianDate: 2005-06-02 01:01:56
Subject: Re: NOLOGGING option, or ?
Previous:From: Luke LonerganDate: 2005-06-02 00:05:29
Subject: Re: NOLOGGING option, or ?

pgsql-patches by date

Next:From: Christopher Kings-LynneDate: 2005-06-02 01:04:03
Subject: Re: patch for between symmetric, asymmetric (from TODO)
Previous:From: Alon GoldshuvDate: 2005-06-01 23:34:37
Subject: COPY fast parse patch

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group