Quick Links

Re: Group commit, revised

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Peter Geoghegan <peter(at)2ndquadrant(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Group commit, revised
Date:	2012-01-19 17:40:08
Message-ID:	CA+Tgmoa5LRCrCevAbqjAZxnqX7Qgv0sK9GpxsOwm2Yg3=srZ5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jan 18, 2012 at 5:38 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Jan 18, 2012 at 1:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> I found it very helpful to reduce wal_writer_delay in pgbench tests, when
>>> running with synchronous_commit=off. The reason is that hint bits don't get
>>> set until the commit record is flushed to disk, so making the flushes more
>>> frequent reduces the contention on the clog. However, Simon made async
>>> commits nudge WAL writer if the WAL page fills up, so I'm not sure how
>>> relevant that experience is anymore.
>
> Still completely relevant and orthogonal to this discussion. The patch
> retains multi-modal behaviour.

I don't know what you mean by this. I think removing wal_writer_delay
is premature, because I think it still may have some utility, and the
patch removes it. That's a separate change that should be factored
out of this patch and discussed separately.

>> There's still a small but measurable effect there in master. I think
>> we might be able to make it fully auto-tuning, but I don't think we're
>> fully there yet (not sure how much this patch changes that equation).
>>
>> I suggested a design similar to the one you just proposed to Simon
>> when he originally suggested this feature. It seems that if the WAL
>> writer is the only one doing WAL flushes, then there must be some IPC
>> overhead - and context switching - involved whenever WAL is flushed.
>> But clearly we're saving something somewhere else, on the basis of
>> Peter's results, so maybe it's not worth worrying about. It does seem
>> pretty odd to have all the regular backends going through the WAL
>> writer and the auxiliary processes using a different mechanism,
>> though. If we got rid of that, maybe WAL writer wouldn't even require
>> a lock, if there's only one process that can be doing it at a time.
>
> When we did sync rep it made sense to have the WALSender do the work
> and for others to just wait. It would be quite strange to require a
> different design for essentially the same thing for normal commits and
> WAL flushes to local disk. I should mention the original proposal for
> streaming replication had each backend send data to standby
> independently and that was recognised as a bad idea after some time.
> Same for sync rep also.

I don't think those cases are directly comparable. SR is talking to
another machine, and I can't imagine that there is a terribly
convenient or portable way for every backend that needs one to get a
hold of the file descriptor for the socket. Even if it could, the
data is sent as a stream, so if multiple backends sent to the same
file descriptor you'd have to have some locking to prevent messages
from getting interleaved. Or else you could have multiple
connections, or use UDP, but that gets rather complex. Anyway, none
of this is an issue for file I/O: anybody can open the file, and if
two backends write data at different offsets at the same time - or the
same data at the same offset at the same time - there's no problem.
So the fact that it wasn't a good idea for SR doesn't convince me that
it's also a bad idea here.

On the other hand, I'm not saying we *should* do it that way, either -
i.e. I am not trying to "require a different design" just because it's
fun to make people change things. Rather, I am trying to figure out
whether the design you've chosen is in fact the best one, and part of
that involves reasoning about why it might or might not be. There are
obvious reasons to think that having process A kick process B and go
to sleep, then have process B do some work and wake up process A might
be less efficient than having process A just do the work itself, in
the uncontended case. The fact that that isn't happening seems
awfully strange to me; it's hard to understand why 3 system calls are
faster than one. That suggests that either the current system is
badly broken in some way that this patch fixes (in which case, it
would be nice to know what the broken thing is) or that there's an
opportunity for further optimization of the new patch (either now or
later, depending on how much work we're talking about). Just to be
clear, it makes perfect sense that the new system is faster in the
contended case, and the benchmark numbers are very impressive. It's a
lot harder to understand why it's not slower in the uncontended case.

> Not sure why its odd to have backends do one thing and auxiliaries do
> another. The whole point of auxiliary processes is that they do a
> specific thing different to normal backends. Anyway, the important
> thing is to have auxiliary processes be independent of each other as
> much as possible, which simplifies error handling and state logic in
> the postmaster.

Yeah, I guess the shutdown sequence could get a bit complex if we try
to make everyone go through the WAL writer all the time. But I wonder
if we could rejiggering things somehow so that everything goes through
WAL writer if its dead.

+ * Wait for group commit, and then return true, if group commit serviced the
+ * request (not necessarily successfully). Otherwise, return false
and fastpath
+ * out of here, allowing the backend to make alternative arrangements to flush
+ * its WAL in a more granular fashion. This can happen because the record that
+ * the backend requests to have flushed in so far into the future
that to group
+ * commit it would

This trails off in mid-sentence.

+ if (delta > XLOG_SEG_SIZE * CheckPointSegments ||
+ !ProcGlobal->groupCommitAvailable)

That seems like a gigantic value. I would think that we'd want to
forget about group commit any time we're behind by more than one
segment (maybe less).

+ if (ProcDiePending || QueryCancelPending)
+ {
+ GroupCommitCancelWait();
+
+ /*
+ * Let out-of-line interrupt handler take it
from here. Cannot raise
+ * an error here because we're in an enclosing
critical block.
+ */
+ break;
+ }

Presumably in this case we need to return false, not true. TBH, I
feel like this error handling is quite fragile, a fragility it
inherits from sync rep, on which it is based. I am not sure I have a
better idea, though.

+ /*
+ * Backends may still be waiting to have their transactions
committed in batch,
+ * but calling this function prevents any backend from using group commit, and
+ * thus from having a dependency on the WAL Writer.
+ *
+ * When the WAL Writer finishes servicing those remaining backends,
it will not
+ * have any additional work to do, and may shutdown.
+ */
+ void
+ GroupCommitDisable(void)
+ {
+ ProcGlobal->groupCommitAvailable = false;
+ }

I can't imagine that this is safe against memory ordering issues.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Group commit, revised at 2012-01-18 22:38:30 from Simon Riggs

Responses

Re: Group commit, revised at 2012-01-20 03:46:54 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Marti Raudsepp	2012-01-19 17:46:13	Re: Simulating Clog Contention
Previous Message	David E. Wheeler	2012-01-19 17:12:02	Re: Inline Extension