Skip site navigation (1) Skip section navigation (2)

Re: possible new option for wal_sync_method

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Dan Scales <scales(at)vmware(dot)com>
Subject: Re: possible new option for wal_sync_method
Date: 2012-02-27 20:43:49
Message-ID: 201202272143.50110.andres@anarazel.de (view raw or flat)
Thread:
Lists: pgsql-hackers
Hi,

On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:
> Good point, thanks.  From the ext3 source code, it looks like
> ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
> block device, whereas simple direct IO does not.  So, that would make
> this wal_sync_method option less useful, since, as you say, the user
> would have to know if the block device is doing write caching.
The experiments I know which played with disabling write caches nearly always 
had the result that write caching as worth the overhead of syncing.

> For the numbers I reported, I don't think the performance gain is from
> not doing the block device flush.  The system being measured is a Fibre
> Channel disk which should have a fully-nonvolatile disk array.  And
> measurements using systemtap show that blkdev_issue_flush() always takes
> only in the microsecond range.
Well, I think it has some io queue implications which could explain some of 
the difference. With that regard I think it heavily depends on the kernel 
version as thats an area which had loads of pretty radical changes in nearly 
every release since 2.6.32.

> I think the overhead is still from the fact that ext3_sync_file() waits
> for the current in-flight transaction if there is one (and does an
> explicit device flush if there is no transaction to wait for.)  I do
> think there are lots of meta-data operations happening on the data files
> (especially for a growing database), so the WAL log commit is waiting for
> unrelated data operations.  It would be nice if there a simple file
> system operation that just flushed the cache of the block device
> containing the filesystem (i.e. just does the blkdev_issue_flush() and
> not the other things in ext3_sync_file()).
I think you are right there. I think the metadata issue could be relieved a 
lot by doing the growing of files in way much larger bits than currently. I 
have seen profiles which indicated that lots of time was spent on increasing 
the file size. I would be very interested in seing how much changes in that 
area would benefit real-world benchmarks.

> The ext4_sync_file() code looks fairly similar, so I think it may have
> the same problem, though I can't be positive.  In that case, this
> wal_sync_method option might help ext4 as well.
The journaling code for ext4 is significantly different so I think it very 
well might play a role here - although youre probably right and it wont be in 
*_sync_file.

> With respect to sync_file_range(), the Linux code that I'm looking at
> doesn't really seem to indicate that there is a device flush (since it
> never calls a f_op->fsync_file operation).  So sync_file_range() may be
> not be as useful as thought.
Hm, need to check that. I thought it invoked that path somewhere.

> By the way, all the numbers were measured with "data=writeback,
> barrier=1" options for ext3.  I don't think that I have seen a
> significant different when the DBT2 workload for ext3 option
> data=ordered.
You have not? Interesting again because I have seen results that differed by a 
magnitude.

> I will measure all these numbers again tonight, but with barrier=0, so as
> to try to confirm that the write flush itself isn't costing a lot for
> this configuration.
Got any result so far?

Thanks,

Andres

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2012-02-27 20:53:58
Subject: Re: Command Triggers, patch v11
Previous:From: Tom LaneDate: 2012-02-27 20:39:51
Subject: Re: pgstat documentation tables

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group