Re: O_DIRECT for WAL writes

From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-02 00:08:14
Message-ID: 1117670894.2922.339.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Mon, 2005-05-30 at 16:29 +1000, Neil Conway wrote:
> On Mon, 2005-05-30 at 10:59 +0900, ITAGAKI Takahiro wrote:
> > Yes, I've tested pgbench and dbt2 and their performances have improved.
> > The two results are as follows:
> >
> > 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> > (attached image)
> > tps | wal_sync_method
> > -------+-------------------------------------------------------
> > 147.0 | open_direct + write multipage (previous patch)
> > 147.2 | open_direct (this patch)
> > 109.9 | open_sync
>
> I'm surprised this makes as much of a difference as that benchmark would
> suggest. I wonder if we're benchmarking the right thing, though: is
> opening a file with O_DIRECT sufficient to ensure that a write(2) does
> not return until the data has hit disk? (As would be the case with
> O_SYNC.) O_DIRECT means the OS will attempt to minimize caching, but
> that is not necessarily the same thing: for example, I can imagine an
> implementation in which the kernel would submit the appropriate I/O to
> the disk when it sees a write(2) on a file opened with O_DIRECT, but
> then let the write(2) return before getting confirmation from the disk
> that the I/O has succeeded or failed. From googling, the MySQL
> documentation for innodb_flush_method notes:
>
> This option is only relevant on Unix systems. If set to
> fdatasync, InnoDB uses fsync() to flush both the data and log
> files. If set to O_DSYNC, InnoDB uses O_SYNC to open and flush
> the log files, but uses fsync() to flush the datafiles. If
> O_DIRECT is specified (available on some GNU/Linux versions
> starting from MySQL 4.0.14), InnoDB uses O_DIRECT to open the
> datafiles, and uses fsync() to flush both the data and log
> files.
>
> That would suggest O_DIRECT by itself is not sufficient to force a flush
> to disk -- if anyone has some more definitive evidence that would be
> welcome.

I know I'm late to this discussion, and I haven't made it all the way
through this thread to see if your questions on Linux writes were
resolved. If you are still interested, I recommend read a very good
one page description of reliable writes buried in the Data Center Linux
Goals and Capabilities document. It is on page 159 of the document, the
item is "R.ReliableWrites" in this _giant PDF file (do a wget and open
it locally ; don't try to read it directly):

http://www.osdlab.org/lab_activities/data_center_linux/DCL_Goals_Capabilities_1.1.pdf

The information came from me interviewing Daniel McNeil, an OSDL
Engineer who wrote and tested much of the Linux async IO code, after I
was similarly confused about when a write is "guaranteed". Reliable
writes, as you can imagine, are very important to Data Center folks,
which is how it happens to be in this document.

Hope this helps.
>
> Anyway, if the above is true, we'll need to use O_DIRECT as well as one
> of the existing wal_sync_methods.
>
> BTW, from the patch:
>
> + /* TODO: Aligment depends on OS and filesystem. */
> + #define O_DIRECT_BUFFER_ALIGN 4096
>
> I suppose there's no reasonable way to autodetect this, so we'll need to
> expose it as a GUC variable (or perhaps a configure option), which is a
> bit unfortunate.
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
--
Mary Edie Meredith
maryedie(at)osdl(dot)org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-06-02 01:01:56 Re: NOLOGGING option, or ?
Previous Message Luke Lonergan 2005-06-02 00:05:29 Re: NOLOGGING option, or ?

Browse pgsql-patches by date

  From Date Subject
Next Message Christopher Kings-Lynne 2005-06-02 01:04:03 Re: patch for between symmetric, asymmetric (from TODO)
Previous Message Alon Goldshuv 2005-06-01 23:34:37 COPY fast parse patch