Re: Use of O_DIRECT only for open_* sync options

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Use of O_DIRECT only for open_* sync options
Date: 2011-01-23 13:43:11
Message-ID: 4D3C306F.8030209@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian wrote:
> xlogdefs.h says:
>
> /*
> * Because O_DIRECT bypasses the kernel buffers, and because we never
> * read those buffers except during crash recovery, it is a win to use
> * it in all cases where we sync on each write(). We could allow O_DIRECT
> * with fsync(), but because skipping the kernel buffer forces writes out
> * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> * Also, O_DIRECT is never enough to force data to the drives, it merely
> * tries to bypass the kernel cache, so we still need O_SYNC or fsync().
> */
>
> This seems wrong because fsync() can win if there are two writes before
> the sync call. Can kernels not issue fsync() if the write was O_DIRECT?
> If that is the cause, we should document it.
>

The comment does look busted, because you did imagine exactly a case
where they might be combined. The only incompatibility that I'm aware
of is that O_DIRECT requires reads and writes to be aligned properly, so
you can't use it in random application code unless it's aware of that.
O_DIRECT and fsync are compatible; for example, MySQL allows combining
the two: http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html

(That whole bit of documentation around innodb_flush_method includes
some very interesting observations around O_DIRECT actually)

I'm starting to consider the idea that much of the performance gains
seen on earlier systems with O_DIRECT was because it decreased CPU usage
shuffling things into the OS cache, rather than its impact on avoiding
pollution of said cache. On Linux for example, its main accomplishment
is decribed like this: "File I/O is done directly to/from user space
buffers."
http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html The
earliest paper on the implementation suggests a big decrease in CPU
overhead from that:
http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html

Impossible to guess whether that's more true ("CPU cache pollution is a
bigger problem now") or less true ("drives are much slower relative to
CPUs now") today. I'm trying to remain agnostic and let the benchmarks
offer an opinion instead.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Colson 2011-01-23 14:29:20 Re: Perl 5.12 complains about ecpg parser-hacking scripts
Previous Message Magnus Hagander 2011-01-23 11:33:29 Re: pg_basebackup for streaming base backups