Re: Group commit, revised

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group commit, revised
Date: 2012-01-20 03:46:54
Message-ID: CAEYLb_UcWD+EOYkat=ZF-SyZP=bhHJ2ciZTeVn6=K3RWvje6RQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 January 2012 17:40, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I don't know what you mean by this.  I think removing wal_writer_delay
> is premature, because I think it still may have some utility, and the
> patch removes it.  That's a separate change that should be factored
> out of this patch and discussed separately.

FWIW, I don't really care too much if we keep wal_writer_delay,
provided it is only used in place of the patch's
WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt
that the effect with asynchronous commits and hint bits is pronounced
enough to have ever transferred through to someone making a practical
recommendation to reduce wal_writer_delay to ameliorate clog
contention.

>> When we did sync rep it made sense to have the WALSender do the work
>> and for others to just wait. It would be quite strange to require a
>> different design for essentially the same thing for normal commits and
>> WAL flushes to local disk. I should mention the original proposal for
>> streaming replication had each backend send data to standby
>> independently and that was recognised as a bad idea after some time.
>> Same for sync rep also.
>
> I don't think those cases are directly comparable.  SR is talking to
> another machine, and I can't imagine that there is a terribly
> convenient or portable way for every backend that needs one to get a
> hold of the file descriptor for the socket.  Even if it could, the
> data is sent as a stream, so if multiple backends sent to the same
> file descriptor you'd have to have some locking to prevent messages
> from getting interleaved.  Or else you could have multiple
> connections, or use UDP, but that gets rather complex.  Anyway, none
> of this is an issue for file I/O: anybody can open the file, and if
> two backends write data at different offsets at the same time - or the
> same data at the same offset at the same time - there's no problem.
> So the fact that it wasn't a good idea for SR doesn't convince me that
> it's also a bad idea here.
>
> On the other hand, I'm not saying we *should* do it that way, either -
> i.e. I am not trying to "require a different design" just because it's
> fun to make people change things.  Rather, I am trying to figure out
> whether the design you've chosen is in fact the best one, and part of
> that involves reasoning about why it might or might not be.  There are
> obvious reasons to think that having process A kick process B and go
> to sleep, then have process B do some work and wake up process A might
> be less efficient than having process A just do the work itself, in
> the uncontended case.  The fact that that isn't happening seems
> awfully strange to me; it's hard to understand why 3 system calls are
> faster than one.  That suggests that either the current system is
> badly broken in some way that this patch fixes (in which case, it
> would be nice to know what the broken thing is) or that there's an
> opportunity for further optimization of the new patch (either now or
> later, depending on how much work we're talking about).  Just to be
> clear, it makes perfect sense that the new system is faster in the
> contended case, and the benchmark numbers are very impressive.  It's a
> lot harder to understand why it's not slower in the uncontended case.
>
>> Not sure why its odd to have backends do one thing and auxiliaries do
>> another. The whole point of auxiliary processes is that they do a
>> specific thing different to normal backends. Anyway, the important
>> thing is to have auxiliary processes be independent of each other as
>> much as possible, which simplifies error handling and state logic in
>> the postmaster.
>
> Yeah, I guess the shutdown sequence could get a bit complex if we try
> to make everyone go through the WAL writer all the time.  But I wonder
> if we could rejiggering things somehow so that everything goes through
> WAL writer if its dead.

I'm not sure what you mean by this last bit. It sounds a bit hazardous.

My problem with nominating a backend to the status of an auxiliary is
that no matter what way you cut it, it increases the failure surface
area, so to speak.

I'm not sure why Heikki thinks that it follows that having a
dependency on some backend is simpler than having one on an auxiliary
process. As to the question of IPC and context switch overhead, I'd
speculate that protecting access to a data structure with book keeping
information regarding which backend is currently the driver and so on
might imply considerably more overhead than IPC and context switching.
It might also be that having WAL Writer almost solely responsible for
this might facilitate more effective use of CPU cache.

On most modern architectures, system calls don't actually cause a full
context switch. The kernel can just enter a "mode switch" (go from
user mode to kernel mode, and then back to user mode). You can observe
this effect with vmstat. That's how 3 system calls might not look much
more expensive than 1.

> +       if (delta > XLOG_SEG_SIZE * CheckPointSegments ||
> +                       !ProcGlobal->groupCommitAvailable)
>
> That seems like a gigantic value.  I would think that we'd want to
> forget about group commit any time we're behind by more than one
> segment (maybe less).

I'm sure that you're right - I myself described it as horrible in my
original mail. I think that whatever value we set this to ought to be
well reasoned. Right now, your magic number doesn't seem much better
than my bogus heuristic (which only ever served as a placeholder
implementation that hinted at a well-principled formula). What's your
basis for suggesting that that limit would always be close to optimal?

Once again, I ask the question: what does a big flusher look like?

> +               if (ProcDiePending || QueryCancelPending)
> +               {
> +                       GroupCommitCancelWait();
> +
> +                       /*
> +                        * Let out-of-line interrupt handler take it
> from here. Cannot raise
> +                        * an error here because we're in an enclosing
> critical block.
> +                        */
> +                       break;
> +               }
>
> Presumably in this case we need to return false, not true.

No, that is not an error. As described in the comment above that
function, the return value indicates if group commit serviced the
request, not necessarily successfully. If we were to return false,
we'd be acting as if group commit just didn't want to flush that LSN,
necessitating making alternative arrangements (calling XLogWrite()).
However, we should just hurry up and get to the interrupt handler to
error and report failure to the client (though not before executing
the code immediately before "return true" at the end of the function).
Maybe you could argue that it would be better to do that anyway, but I
think that it could mask problems and unnecessarily complicate
matters.

> TBH, I feel like this error handling is quite fragile, a fragility it
> inherits from sync rep, on which it is based.  I am not sure I have a
> better idea, though.

I agree with this analysis. However, since no better alternative
suggests itself, we may find that an XXX comment will go a long way.

> + void
> + GroupCommitDisable(void)
> + {
> +       ProcGlobal->groupCommitAvailable = false;
> + }
>
> I can't imagine that this is safe against memory ordering issues.

That had occurred to me. Strictly speaking, it doesn't necessarily
have to be, provided that some architecture with weak memory ordering
is not inclined to take an inordinate amount of time, or even forever,
to notice that the variable has been set to false. WAL Writer calls
this to "turn the tap off" to allow it to finish its group commit
backlog (allowing those backends to shutdown) and shutdown itself, but
it doesn't actually assume that turning the tap off will be
immediately effective.

However, now that I think about it, even if my assumption about the
behaviour of all machines regarding memory ordering there doesn't fall
down, it does seem possible that GroupCommitShutdownReady() could
return true, but that status could immediately change because some
backend on Alpha or something didn't see that the flag was set, so
your point is well taken.

By the way, how goes the introduction of memory barriers to Postgres?
I'm aware that you committed an implementation back in September, but
am not sure what the plan is as regards using them in 9.2.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Singer 2012-01-20 04:01:49 Re: Online base backup from the hot-standby
Previous Message Fujii Masao 2012-01-20 03:43:17 Re: WAL Restore process during recovery