Re: possible new option for wal_sync_method

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Dan Scales <scales(at)vmware(dot)com>
Subject: Re: possible new option for wal_sync_method
Date: 2012-02-27 20:43:49
Message-ID: 201202272143.50110.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:
> Good point, thanks. From the ext3 source code, it looks like
> ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
> block device, whereas simple direct IO does not. So, that would make
> this wal_sync_method option less useful, since, as you say, the user
> would have to know if the block device is doing write caching.
The experiments I know which played with disabling write caches nearly always
had the result that write caching as worth the overhead of syncing.

> For the numbers I reported, I don't think the performance gain is from
> not doing the block device flush. The system being measured is a Fibre
> Channel disk which should have a fully-nonvolatile disk array. And
> measurements using systemtap show that blkdev_issue_flush() always takes
> only in the microsecond range.
Well, I think it has some io queue implications which could explain some of
the difference. With that regard I think it heavily depends on the kernel
version as thats an area which had loads of pretty radical changes in nearly
every release since 2.6.32.

> I think the overhead is still from the fact that ext3_sync_file() waits
> for the current in-flight transaction if there is one (and does an
> explicit device flush if there is no transaction to wait for.) I do
> think there are lots of meta-data operations happening on the data files
> (especially for a growing database), so the WAL log commit is waiting for
> unrelated data operations. It would be nice if there a simple file
> system operation that just flushed the cache of the block device
> containing the filesystem (i.e. just does the blkdev_issue_flush() and
> not the other things in ext3_sync_file()).
I think you are right there. I think the metadata issue could be relieved a
lot by doing the growing of files in way much larger bits than currently. I
have seen profiles which indicated that lots of time was spent on increasing
the file size. I would be very interested in seing how much changes in that
area would benefit real-world benchmarks.

> The ext4_sync_file() code looks fairly similar, so I think it may have
> the same problem, though I can't be positive. In that case, this
> wal_sync_method option might help ext4 as well.
The journaling code for ext4 is significantly different so I think it very
well might play a role here - although youre probably right and it wont be in
*_sync_file.

> With respect to sync_file_range(), the Linux code that I'm looking at
> doesn't really seem to indicate that there is a device flush (since it
> never calls a f_op->fsync_file operation). So sync_file_range() may be
> not be as useful as thought.
Hm, need to check that. I thought it invoked that path somewhere.

> By the way, all the numbers were measured with "data=writeback,
> barrier=1" options for ext3. I don't think that I have seen a
> significant different when the DBT2 workload for ext3 option
> data=ordered.
You have not? Interesting again because I have seen results that differed by a
magnitude.

> I will measure all these numbers again tonight, but with barrier=0, so as
> to try to confirm that the write flush itself isn't costing a lot for
> this configuration.
Got any result so far?

Thanks,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-02-27 20:53:58 Re: Command Triggers, patch v11
Previous Message Tom Lane 2012-02-27 20:39:51 Re: pgstat documentation tables