On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:
> Good point, thanks. From the ext3 source code, it looks like
> ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
> block device, whereas simple direct IO does not. So, that would make
> this wal_sync_method option less useful, since, as you say, the user
> would have to know if the block device is doing write caching.
The experiments I know which played with disabling write caches nearly always
had the result that write caching as worth the overhead of syncing.
> For the numbers I reported, I don't think the performance gain is from
> not doing the block device flush. The system being measured is a Fibre
> Channel disk which should have a fully-nonvolatile disk array. And
> measurements using systemtap show that blkdev_issue_flush() always takes
> only in the microsecond range.
Well, I think it has some io queue implications which could explain some of
the difference. With that regard I think it heavily depends on the kernel
version as thats an area which had loads of pretty radical changes in nearly
every release since 2.6.32.
> I think the overhead is still from the fact that ext3_sync_file() waits
> for the current in-flight transaction if there is one (and does an
> explicit device flush if there is no transaction to wait for.) I do
> think there are lots of meta-data operations happening on the data files
> (especially for a growing database), so the WAL log commit is waiting for
> unrelated data operations. It would be nice if there a simple file
> system operation that just flushed the cache of the block device
> containing the filesystem (i.e. just does the blkdev_issue_flush() and
> not the other things in ext3_sync_file()).
I think you are right there. I think the metadata issue could be relieved a
lot by doing the growing of files in way much larger bits than currently. I
have seen profiles which indicated that lots of time was spent on increasing
the file size. I would be very interested in seing how much changes in that
area would benefit real-world benchmarks.
> The ext4_sync_file() code looks fairly similar, so I think it may have
> the same problem, though I can't be positive. In that case, this
> wal_sync_method option might help ext4 as well.
The journaling code for ext4 is significantly different so I think it very
well might play a role here - although youre probably right and it wont be in
> With respect to sync_file_range(), the Linux code that I'm looking at
> doesn't really seem to indicate that there is a device flush (since it
> never calls a f_op->fsync_file operation). So sync_file_range() may be
> not be as useful as thought.
Hm, need to check that. I thought it invoked that path somewhere.
> By the way, all the numbers were measured with "data=writeback,
> barrier=1" options for ext3. I don't think that I have seen a
> significant different when the DBT2 workload for ext3 option
You have not? Interesting again because I have seen results that differed by a
> I will measure all these numbers again tonight, but with barrier=0, so as
> to try to confirm that the write flush itself isn't costing a lot for
> this configuration.
Got any result so far?
In response to
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2012-02-27 20:53:58|
|Subject: Re: Command Triggers, patch v11 |
|Previous:||From: Tom Lane||Date: 2012-02-27 20:39:51|
|Subject: Re: pgstat documentation tables |