Re: Reworking the writing of WAL

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reworking the writing of WAL
Date: 2011-08-12 18:02:17
Message-ID: CA+TgmoYR6sXfyS6gJCE-+BLpcvVDBZaO_=dObL+B+XdQBDsk1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 12, 2011 at 11:34 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> 1. Earlier, I suggested that the sync rep code would allow us to
> redesign the way we write WAL, using ideas from group commit. My
> proposal is that when when a backend needs to flush WAL to local disk
> it will be added to a SHMQUEUE exactly the same as when we flush WAL
> to sync standby. The WALWriter will be woken by latch and then perform
> the actual work. When complete WALWriter will wake the queue in order,
> so there is a natural group commit effect. The WAL queue will be
> protected by a new lock WALFlushRequestLock, which should be much less
> heavily contended than the way we do things now. Notably this approach
> will mean that all waiters get woken quickly, without having to wait
> for the queue of WALWriteLock requests to drain down, so commit will
> be marginally quicker. On almost idle systems this will give very
> nearly the same response time as having each backend write WAL
> directly. On busy systems this will give optimal efficiency by having
> WALWriter working in a very tight loop to perform the I/O instead of
> queuing itself to get the WALWriteLock with all the other backends. It
> will also allow piggybacking of commits even when WALInsertLock is not
> available.

I like the idea of putting all the backends that are waiting for xlog
flush on a SHM_QUEUE, and having a single process do the flush and
then wake them all up. That seems like a promising approach, and
should avoid quite a bit of context-switching and spinlocking that
would otherwise be necessary. However, I think it's possible that the
overhead in the single-client case might be pretty significant, and
I'm wondering whether we might be able to set things up so that
backends can flush their own WAL in the uncontended case.

What I'm imagining is something like this:

struct {
slock_t mutex;
XLogRecPtr CurrentFlushLSN;
XLogRecPtr HighestFlushLSN;
SHM_QUEUE WaitersForCurrentFlush;
SHM_QUEUE WaitersForNextFlush;
};

To flush, you first acquire the mutex. If the CurrentFlushLSN is not
InvalidXLogRecPtr, then there's a flush in progress, and you add
yourself to either WaitersForCurrentFlush or WaitersForNextFlush,
depending on whether your LSN is lower or higher than CurrentFlushLSN.
If you queue on WaitersForNextFlush you advance HighestFlushLSN to
the LSN you need flushed. You then release the spinlock and sleep on
your semaphore.

But if you get the mutex and find that CurrentFlushLSN is XLogRecPtr,
then you know that no flush is in progress. In that case, you set
CurrentFlushLSN to the maximum of the LSN you need flushed and
HighestFlushLSN and move all WaitersForNextFlush over to
WaitersForCurrentFlush. You then release the spinlock and perform the
flush. After doing so, you reacquire the spinlock, remove everyone
from WaitersForCurrentFlush, note whether there are any
WaitersForNextFlush, and release the spinlock. If there were any
WaitersForNextFlush, you set the WAL writer latch. You then wake up
anyone you removed from WaitersForCurrentFlush.

Every time the WAL writer latch is set, the WAL writer wakes up and
performs any needed flush, unless there's already one in progress.

This allows processes to flush their own WAL when there's no
contention, but as more contention develops the work moves to the WAL
writer which will then run in a tight loop, as in your proposal.

> 5. And we would finally get rid of the group commit parameters.

That would be great, and I think the performance will be quite a bit
better, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-08-12 18:09:35 VACUUM FULL versus system catalog cache invalidation
Previous Message David E. Wheeler 2011-08-12 16:37:58 Re: sha1, sha2 functions into core?