Skip site navigation (1) Skip section navigation (2)

Re: POSIX file updates

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: James Mansion <james(at)mansionfamily(dot)plus(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: POSIX file updates
Date: 2008-03-31 22:44:27
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Mon, 31 Mar 2008, James Mansion wrote:

> Is it correct that POSIX requires that the updates to a single
> file are serialised in the filesystem layer?

Quoting from Lewine's "POSIX Programmer's Guide":

"After a write() to a regular file has successfully returned, any 
successful read() from each byte position in the file that was modified by 
that write() will return the data that was written by the write()...a 
similar requirement applies to multiple write operations to the same file 

That's the "contract" that has to be honored.  How your filesystem 
actually implements this contract is none of a POSIX write() call's 
business, so long as it does.

It is the case that multiple writers to the same file can get serialized 
somewhere because of how this call is implemented though, so you're 
correct about that aspect of the practical impact being a possibility.

> So, if we have a number of dirty pages to write back to a single
> file in the database (whether a table or index) then we cannot
> pass these through the POSIX filesystem layer into the TCQ/NCQ
> system on the disk drive, so it can reorder them?

As long as the reordering mechanism also honors that any reads that come 
after a write to a block reflect that write, they can be reordered.  The 
filesystem and drives are already doing elevator sorting and similar 
mechanisms underneath you to optimize things.  Unless you use a sync 
operation or some sort of write barrier, you don't really know what has 

> I have seen suggestions that on Solaris this can be relaxed.

There's some good notes in this area at: and

It's clear that such relaxation has benefits with some of Oracle's 
mechanisms as described.  But amusingly, PostgreSQL doesn't even support 
Solaris's direct I/O method right now unless you override the filesystem 
mounting options, so you end up needing to split it out and hack at that 
level regardless.

> I *assume* that PostgreSQL's lack of threads or AIO and the
> single bgwriter means that PostgreSQL 8.x does not normally
> attempt to make any use of such a relaxation but could do so if the
> bgwriter fails to keep up and other backends initiate flushes.

PostgreSQL writes transactions to the WAL.  When they have reached disk, 
confirmed by a successful f[data]sync or a completed syncronous write, 
that transactions is now committed.  Eventually the impacted items in the 
buffer cache will be written as well.  At checkpoint time, things are 
reconciled such that all dirty buffers at that point have been written, 
and now f[data]sync is called on each touched file to make sure those 
changes have made it to disk.

Writes are assumed to be lost in some memory (kernel, filesystem or disk 
cache) until they've been confirmed to be written to disk via the sync 
mechanism.  When a backend flushes a buffer out, as soon as the OS caches 
that write the database backend moves on without being concerned about how 
it's eventually going to get to disk one day.  As long as the newly 
written version comes back again if it's read, the database doesn't worry 
about what's happening until it specifically asks for a sync that proves 
everything is done.  So if the backends or the background writer are 
spewing updates out, they don't care if the OS doesn't guarantee the order 
they hit disk until checkpoint time; it's only the synchronous WAL writes 
that do.

Also note that it's usually the case that backends write a substantial 
percentage of the buffers out themselves.  You should assume that's the 
case unless you've done some work to prove the background writer is 
handling most writes (which is difficult to even know before 8.3, much 
less tune for).

That how I understand everything to work at least.  I will add the 
disclaimer that I haven't looked at the archive recovery code much yet. 
Maybe there's some expectation it has for general database write ordering 
in order for the WAL replay mechanism to work correctly, I can't imagine 
how that could work though.

* Greg Smith gsmith(at)gregsmith(dot)com Baltimore, MD

In response to


pgsql-performance by date

Next:From: Ravi ChemuduguntaDate: 2008-04-01 00:20:34
Subject: Performance Implications of Using Exceptions
Previous:From: James MansionDate: 2008-03-31 21:28:14
Subject: Re: POSIX file updates

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group