Re: Group commit, revised

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group commit, revised
Date: 2012-01-16 08:11:53
Message-ID: 4F13DBC9.3070603@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16.01.2012 00:42, Peter Geoghegan wrote:
> I've also attached the results of a pgbench-tools driven benchmark,
> which are quite striking (Just the most relevant image - e-mail me
> privately if you'd like a copy of the full report, as I don't want to
> send a large PDF file to the list as a courtesy to others). Apart from
> the obvious improvement in throughput, there is also a considerable
> improvement in average and worst latency at all client counts.

Impressive results. How about uploading the PDF to the community wiki?

> The main way that I've added value here is by refactoring and fixing
> bugs. There were some tricky race conditions that caused the
> regression tests to fail for that early draft patch, but there were a
> few other small bugs too. There is an unprecedented latch pattern
> introduced by this patch: Two processes (the WAL Writer and any given
> connection backend) have a mutual dependency. Only one may sleep, in
> which case the other is responsible for waking it. Much of the
> difficulty in debugging this patch, and much of the complexity that
> I've added, came from preventing both from simultaneously sleeping,
> even in the face of various known failure modes like postmaster death.
> If this happens, it does of course represent a denial-of-service, so
> that's something that reviewers are going to want to heavily
> scrutinise. I suppose that sync rep should be fine here, as it waits
> on walsenders, but I haven't actually comprehensively tested the
> patch, so there may be undiscovered unpleasant interactions with other
> xlog related features. I can report that it passes the regression
> tests successfully, and on an apparently consistently basis - I
> battled with intermittent failures for a time.

I think it might be simpler if it wasn't the background writer that's
responsible for "driving" the group commit queue, but the backends
themselves. When a flush request comes in, you join the queue, and if
someone else is already doing the flush, sleep until the driver wakes
you up. If no-one is doing the flush yet (ie. the queue is empty), start
doing it yourself. You'll need a state variable to keep track who's
driving the queue, but otherwise I think it would be simpler as there
would be no dependency on WAL writer.

I tend think of the group commit facility as a bus. Passengers can hop
on board at any time, and they take turns on who drives the bus. When
the first passengers hops in, there is no driver so he takes the driver
seat. When the next passenger hops in, he sees that someone is driving
the bus already, so he sits down, and places a big sign on his forehead
stating the bus stop where he wants to get off, and goes to sleep. When
the driver has reached his own bus stop, he wakes up all the passengers
who wanted to get off at the same stop or any of the earlier stops [1].
He also wakes up the passenger whose bus stop is the farthest from the
current stop, and gets off the bus. The woken-up passengers who have
already reached their stops can immediately get off the bus, and the one
who has not notices that no-one is driving the bus anymore, so he takes
the driver seat.

[1] in a real bus, a passenger would not be happy if he's woken up too
late and finds himself at the next stop instead of the one where he
wanted to go, but for group commit, that is fine.

In this arrangement, you could use the per-process semaphore for the
sleep/wakeups, instead of latches. I'm not sure if there's any
difference, but semaphores are more tried and tested, at least.

> The benefits (and, admittedly, the controversies) of this patch go
> beyond mere batching of commits: it also largely, though not entirely,
> obviates the need for user backends to directly write/flush WAL, and
> the WAL Writer may never sleep if it continually finds work to do -
> wal_writer_delay is obsoleted, as are commit_siblings and
> commit_delay.

wal_writer_delay is still needed for controlling how often asynchronous
commits are flushed to disk.

> Auxiliary processes cannot use group commit. The changes made prevent
> them from availing of commit_siblings/commit_delay parallelism,
> because it doesn't exist anymore.

Auxiliary processes have PGPROC entries too. Why can't they participate?

> Group commit is sometimes throttled, which seems appropriate - if a
> backend requests that the WAL Writer flush an LSN deemed too far from
> the known flushed point, that request is rejected and the backend goes
> through another path, where XLogWrite() is called.

Hmm, if the backend doing the big flush gets the WALWriteLock before a
bunch of group committers, the group committers will have to wait until
the big flush is finished, anyway. I presume the idea of the throttling
is to avoid the situation where a bunch of small commits need to wait
for a huge flush to finish.

Perhaps the big flusher should always join the queue, but use some
heuristic to first flush up to the previous commit request, to wake up
others quickly, and do another flush to flush its own request after that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-01-16 08:17:31 Re: Standalone synchronous master
Previous Message Ilya Kosmodemiansky 2012-01-16 08:06:21 Re: SKIP LOCKED DATA