Re: possible new option for wal_sync_method

From: Dan Scales <scales(at)vmware(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: possible new option for wal_sync_method
Date: 2012-02-17 00:17:27
Message-ID: 895135865.2038752.1329437847788.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Good point, thanks. From the ext3 source code, it looks like
ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
block device, whereas simple direct IO does not. So, that would make
this wal_sync_method option less useful, since, as you say, the user
would have to know if the block device is doing write caching.

For the numbers I reported, I don't think the performance gain is from
not doing the block device flush. The system being measured is a Fibre
Channel disk which should have a fully-nonvolatile disk array. And
measurements using systemtap show that blkdev_issue_flush() always takes
only in the microsecond range.

I think the overhead is still from the fact that ext3_sync_file() waits
for the current in-flight transaction if there is one (and does an
explicit device flush if there is no transaction to wait for.) I do
think there are lots of meta-data operations happening on the data files
(especially for a growing database), so the WAL log commit is waiting for
unrelated data operations. It would be nice if there a simple file
system operation that just flushed the cache of the block device
containing the filesystem (i.e. just does the blkdev_issue_flush() and
not the other things in ext3_sync_file()).

The ext4_sync_file() code looks fairly similar, so I think it may have
the same problem, though I can't be positive. In that case, this
wal_sync_method option might help ext4 as well.

With respect to sync_file_range(), the Linux code that I'm looking at
doesn't really seem to indicate that there is a device flush (since it
never calls a f_op->fsync_file operation). So sync_file_range() may be
not be as useful as thought.

By the way, all the numbers were measured with "data=writeback,
barrier=1" options for ext3. I don't think that I have seen a
significant different when the DBT2 workload for ext3 option
data=ordered.

I will measure all these numbers again tonight, but with barrier=0, so as
to try to confirm that the write flush itself isn't costing a lot for
this configuration.

Dan

----- Original Message -----
From: "Andres Freund" <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Dan Scales" <scales(at)vmware(dot)com>
Sent: Thursday, February 16, 2012 10:32:09 AM
Subject: Re: [HACKERS] possible new option for wal_sync_method

Hi,

On Thursday, February 16, 2012 06:18:23 PM Dan Scales wrote:
> When running Postgres on a single ext3 filesystem on Linux, we find that
> the attached simple patch gives significant performance benefit (7-8% in
> numbers below). The patch adds a new option for wal_sync_method, which
> is "open_direct". With this option, the WAL is always opened with
> O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only
> O_DIRECT should be correct. All WAL logs are fully allocated before
> being used, and the WAL buffers are 8K-aligned, so all direct writes are
> guaranteed to complete before returning. (See
> http://lwn.net/Articles/348739/)
I don't think that behaviour is safe in the face of write caches in the IO
path. Linux takes care to issue flush/barrier instructions when necessary if
you issue an fsync/fdatasync, but to my knowledge it does not when O_DIRECT is
used (That would suck performancewise).
I think that behaviour is safe if you have no externally visible write caching
enabled but thats not exactly easy to get/document knowledge.

Why should there otherwise be any performance difference between O_DIRECT|
O_SYNC and O_DIRECT in wal write case? There is no metadata that needs to be
written and I have a hard time imaging that the check whether there is
metadata is that expensive.

I guess a more interesting case would be comparing O_DIRECT|O_SYNC with
O_DIRECT + fdatasync() or even O_DIRECT +
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER)

Any special reason youve did that comparison on ext3? Especially with
data=ordered its behaviour regarding syncs is pretty insane performancewise.
Ext4 would be a bit more interesting...

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-02-17 00:27:57 Re: Bug in intarray?
Previous Message Tom Lane 2012-02-16 23:42:15 Re: Designing an extension for feature-space similarity search