Re: Re: Faster CREATE DATABASE by delaying fsync

From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Florian Weimer <fw(at)deneb(dot)enyo(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync
Date: 2010-02-15 00:08:10
Message-ID: 4B78906A.7020309@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On 02/14/2010 03:49 PM, Andres Freund wrote:
> On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
>
>> The widely reported problems, though, did not tend to be a problem with
>> directory changes written too late - but directory changes being written
>> too early. That is, the directory change is written to disk, but the
>> file content is not. This is likely because of the "ordered journal"
>> mode widely used in ext3/ext4 where metadata changes are journalled, but
>> file pages are not journalled. Therefore, it is important for some
>> operations, that the file pages are pushed to disk using fsync(file),
>> before the metadata changes are journalled.
>>
> Well, but thats not a problem with pg as it fsyncs the file contents.
>

Exactly. Not a problem.

>> If you are concerned, enable dirsync.
>>
> If the filesystem already behaves that way a fsync on it should be fairly
> cheap. If it doesnt behave that way doing it is correct...
>

Well, I disagree, as the whole point of this thread is that fsync() is
*not* cheap. :-)

> Besides there is no reason to fsync the directory before the checkpoint, so
> dirsync would require a higher cost than doing it correctly.
>

Using "ordered" metadata journaling has approximately the same effect.
Provided that the data is fsync()'d before the metadata is required,
either the metadata is recorded in the journal, in which case the data
is accessible, or the metadata is NOT recorded in the journal, in which
case, the files will appear missing. The races that theoretically exist
would be in situations where the data of one file references a separate
file that does not yet exist.

You said you would try and reproduce - are you going to try and
reproduce on ext3/ext4 with ordered journalling enabled? I think
reproducing outside of a case such as CREATE DATABASE would be
difficult. It would have to be something like:

open(O_CREAT)/write()/fsync()/close() of new data file, where data
gets written, but directory data is not yet written out to journal
open()/.../write()/fsync()/close() of existing file to point to new
data file, but directory data is still not yet written out to journal
crash

In this case, "dirsync" should be effective at closing this hole.

As for cost? Well, most PostgreSQL data is stored within file content,
not directory metadata. I think "dirsync" might slow down some
operations like CREATE DATABASE or "rm -fr", but I would not expect it
to effect day-to-day performance of the database under real load. Many
operating systems enable the equivalent of "dirsync" by default. I
believe Solaris does this, for example, and other than slowing down "rm
-fr", I don't recall any real complaints about the cost of "dirsync".

After writing the above, I'm seriously considering adding "dirsync" to
my /db mounts that hold PostgreSQL and MySQL data.

Cheers,
mark

--
Mark Mielke<mark(at)mielke(dot)cc>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2010-02-15 00:50:57 pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Previous Message Greg Stark 2010-02-14 23:33:54 Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

Browse pgsql-performance by date

  From Date Subject
Next Message AI Rumman 2010-02-15 09:35:01 Why primary key index are not using in joining?
Previous Message Greg Stark 2010-02-14 23:33:54 Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)