Skip site navigation (1) Skip section navigation (2)

Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs
Date: 2011-08-08 16:39:32
Message-ID: 27512.1312821572@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackers
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Any protocol of that sort has *obviously* got a race condition between
>> changes of the latch state and changes of the separate flag state, if run
>> in a weak-memory-ordering machine.

> On the flip side, I'm not sure that anyone ever really expected that a
> latch could be safely used this way.  Normally, I'd expect the flag to
> be some piece of state protected by an LWLock, and I think that ought
> to be OK provided that the lwlock is released before setting or
> checking the latch.  Maybe we should try to document the contract here
> a little better; I think it's just that there must be a full memory
> barrier (such as LWLockRelease) in both processes involved in the
> interaction.

Heikki argued the same thing in
http://archives.postgresql.org/message-id/4CEA5A0F.1030602@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe.  So [ ... warms up grep ... ]

Currrent callers of WaitLatch(OrSocket):

XLogPageRead waits on XLogCtl->recoveryWakeupLatch

	latch is set by WakeupRecovery, which is called by process'
	own signal handlers (OK) and XLogWalRcvFlush.  The latter is OK
	because there's a spinlock protecting the transmitted data.

pgarch_MainLoop waits on mainloop_latch

	non-issue because latch is process-local

WalSndLoop waits on MyWalSnd->latch

	latch is set by signal handlers and WalSndWakeup.  The latter is
	OK because it's called right after XLogFlush (which certainly
	locks whatever it touches) or a spinlock release.

SyncRepWaitForLSN waits on MyProc->waitLatch

	latch is set by signal handlers and SyncRepWakeQueue.  The
	latter appears unsafe at first glance, because it sets the
	shared variable (thisproc->syncRepState) and immediately
	sets the latch.  However, the receiver does this curious dance:

		syncRepState = MyProc->syncRepState;
		if (syncRepState == SYNC_REP_WAITING)
		{
			LWLockAcquire(SyncRepLock, LW_SHARED);
			syncRepState = MyProc->syncRepState;
			LWLockRelease(SyncRepLock);
		}

	which I believe makes it safe, though rather underdocumented;
	if a race does occur, the receiver will acquire the lock and
	recheck, and the lock acquisition will block until the caller of
	SyncRepWakeQueue has released SyncRepLock, and that release
	will flush any pending writes to shared memory.

Conclusions:

(1) 9.1 does not have a problem of this sort.

(2) Breathing hard on any of this code could break it.

(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code.  Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.

(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking.  I think SyncRepWaitForLSN could use a warning
comment, too.  I will go do that.

			regards, tom lane

In response to

Responses

pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2011-08-08 16:45:07
Subject: Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs
Previous:From: Alvaro HerreraDate: 2011-08-08 16:34:18
Subject: Re: [RFC] Common object property boards

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group