Re: [HACKERS] O_DIRECT for WAL writes

From: Mark Wong <markw(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Daniel McNeil <daniel(at)osdl(dot)org>, Mark Haverkamp <markh(at)osdl(dot)org>
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-08-06 21:04:19
Message-ID: 20050806210419.GA31044@osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Here are comments that Daniel McNeil made earlier, which I've neglected
to forward earlier. I've cc'ed him and Mark Havercamp, which some of
you got to meet the other day.

Mark

-----

With O_DIRECT on Linux, when the write() returns the i/o has been
transferred to the disk.

Normally, this i/o will be DMAed directly from user-space to the
device. The current exception is when doing an O_DIRECT write to a
hole in a file. (If an program does a truncate() or lseek()/write()
that makes a file larger, the file system does not allocated space
between the old end of file and the new end of file.) An O_DIRECT
write to hole like this, requires the file system to allocated space,
but there is a race condition between the O_DIRECT write doing the
allocate and then write to initialized the newly allocated data and
any other process that attempts a buffered (page cache) read of the
same area in the file -- it was possible for the read to data from
the allocated region before the O_DIRECT write(). The fix in Linux
is for the O_DIRECT write() to fall back to use buffer i/o to do
the write() and flush the data from the page cache to the disk.

A write() with O_DIRECT only means the data has been transferred to
the disk. Depending on the file system and mount options, it does
not mean the meta data for the file has been written to disk (see
fsync man page). Fsync() will guarantee the data and metadata have
been written to disk.

Lastly, if a disk has a write back cache, an O_DIRECT write() does not
guarantee that the disk has put the data on the physical media.
I think some of the journal file systems now support i/o barriers
on commit which will flush the disk write back cache. (I'm still
looking the kernel code to see how this is done).

Conclusion:

O_DIRECT + fsync() can make sense. It avoids the copying of data
to the page cache before being written and will also guarantee
that the file's metadata is also written to disk. It also
prevents the page cache from filling up with write data that
will never be read (I assume it is only read if a recovery
is necessary - which should be rare). It can also
helps disks with write back cache when using the journaling
file system that use i/o barriers. You would want to use
large writes, since the kernel page cache won't be writing
multiple pages for you.

I need to look at the kernel code more to comment on O_DIRECT with
O_SYNC.

Questions:

Does the database transaction logger preallocate the log file?

Does the logger care about the order in which each write hits the disk?

Now someone else can comment on my comments.

Daniel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-08-06 21:24:46 Re: unexpected pageaddr on startup/recovery
Previous Message Jim Buttafuoco 2005-08-06 18:20:32 unexpected pageaddr on startup/recovery

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2005-08-06 21:04:52 Re: COPY FROM performance improvements
Previous Message Tom Lane 2005-08-06 17:45:56 Re: default tablespace for roles