Skip site navigation (1) Skip section navigation (2)

Re: Group commit, revised

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group commit, revised
Date: 2012-01-20 22:30:06
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
I spent some time cleaning this up. Details below, but here are the 

* Reverted the removal of wal_writer_delay
* Doesn't rely on WAL writer. Any process can act as the "leader" now.
* Removed heuristic on big flushes
* Uses PGSemaphoreLock/Unlock instead of latches

On 20.01.2012 17:30, Robert Haas wrote:
> On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan<peter(at)2ndquadrant(dot)com>  wrote:
>>> +       if (delta>  XLOG_SEG_SIZE * CheckPointSegments ||
>>> +                       !ProcGlobal->groupCommitAvailable)
>>> That seems like a gigantic value.  I would think that we'd want to
>>> forget about group commit any time we're behind by more than one
>>> segment (maybe less).
>> I'm sure that you're right - I myself described it as horrible in my
>> original mail. I think that whatever value we set this to ought to be
>> well reasoned. Right now, your magic number doesn't seem much better
>> than my bogus heuristic (which only ever served as a placeholder
>> implementation that hinted at a well-principled formula). What's your
>> basis for suggesting that that limit would always be close to optimal?
> It's probably not - I suspect even a single WAL segment still too
> large.  I'm pretty sure that it would be safe to always flush up to
> the next block boundary, because we always write the whole block
> anyway even if it's only partially filled.  So there's no possible
> regression there.  Anything larger than "the end of the current 8kB
> block" is going to be a trade-off between latency and throughput, and
> it seems to me that there's probably only one way to figure out what's
> reasonable: test a bunch of different values and see which ones
> perform well in practice.  Maybe we could answer part of the question
> by writing a test program which does repeated sequential writes of
> increasingly-large chunks of data.  We might find out for example that
> 64kB is basically the same as 8kB because most of the overhead is in
> the system call anyway, but beyond that the curve kinks.  Or it may be
> that 16kB is already twice as slow as 8kB, or that you can go up to
> 1MB without a problem.  I don't see a way to know that without testing
> it on a couple different pieces of hardware and seeing what happens.

I ripped away that heuristic for a flush that's "too large". After 
pondering it for a while, I came to the conclusion that as implemented 
in the patch, it was pointless. The thing is, if the big flush doesn't 
go through the group commit machinery, it's going to grab the 
WALWriteLock straight away. Before any smaller commits can proceed, they 
will need to grab that lock anyway, so the effect is the same as if the 
big flush had just joined the queue.

Maybe we should have a heuristic to split a large flush into smaller 
chunks. The WAL segment boundary would be a quite natural split point, 
for example, because when crossing the file boundary you have to issue 
separate fsync()s for the files anyway. But I'm happy with leaving that 
out for now, it's not any worse than it used to be without group commit 

> Ugh.  Our current system doesn't even require this code to be
> interruptible, I think: you can't receive cancel or die interrupts
> either while waiting for an LWLock or while holding it.

Right. Or within HOLD/RESUME_INTERRUPTS blocks.

The patch added some CHECK_FOR_INTERRUPTS() calls into various places in 
the commit/abort codepaths, but that's all dead code; they're all within 

I replaced the usage of latches with the more traditional 
PGSemaphoreLock/Unlock. It semaphore model works just fine in this case, 
where we have a lwlock to guard the wait queue, and when a process is 
waiting we know it will be woken up or something messed up at a pretty 
low level. We don't need a timeout or to wake up at other signals while 
waiting. Furthermore, the WAL writer didn't have a latch before this 
patch; it's not many lines of code to initialize the latch and set up 
the signal handler for it, but it already has a semaphore that's ready 
to use.

I wonder if we should rename the file into "xlogflush.c" or something 
like that, to make it clear that this works with any XLOG flushes, not 
just commits? Group commit is the usual term for this feature, so we 
should definitely mention that in the comments, but it might be better 
to rename the files/functions to emphasize that this is about WAL 
flushing in general.

This probably needs some further cleanup to fix things I've broken, and 
I haven't done any performance testing, but it's progress. Do you have a 
shell script or something that you used for the performance tests that I 
could run? Or would you like to re-run the tests you did earlier with 
this patch?

   Heikki Linnakangas

Attachment: group_commit_2012_01_21.patch
Description: text/x-diff (42.1 KB)

In response to


pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2012-01-20 22:48:43
Subject: Re: Inline Extension
Previous:From: Daniel FarinaDate: 2012-01-20 22:00:15
Subject: Re: Inline Extension

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group