Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs
Date: 2011-08-08 12:47:28
Message-ID: CA+TgmobyqhvXtizQ7ugR7M27fdtto78LB4c-x407KVVH=SxJug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Any protocol of that sort has *obviously* got a race condition between
> changes of the latch state and changes of the separate flag state, if run
> in a weak-memory-ordering machine.
>
> At least on the hardware I'm testing, it seems that the key requirement
> for a failure to be seen is that there's a lot of contention for the cache
> line containing the flag, but not for the cache line containing the latch.
> This allows the write to the flag to take longer to propagate to main
> memory than the write to the latch.

This is interesting research, especially because it provides sobering
evidence of how easily a real problem could be overlooked in testing.
If it took several minutes for this to fail on an 8-CPU system, it's
easy to imagine a less egregious example sitting in our code-base for
years undetected. (Indeed, I think we probably have a few of those
already...)

On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way. Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch. Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.

> I'm not sure whether we need to do anything about this for 9.1; it may be
> that none of the existing uses of latches are subject to the kind of
> contention that would create a risk.  But I'm entirely convinced that we
> need to insert some memory-barrier instructions into the latch code before
> we expand our uses of latches much more.  Since we were already finding
> that memory barriers will be needed for performance reasons elsewhere,
> I think it's time to bite the bullet and develop that infrastructure.

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
barrier-v1.patch application/octet-stream 6.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-08 13:20:09 Re: vacuumlo patch
Previous Message Anssi Kääriäinen 2011-08-08 11:05:17 Re: Transient plans versus the SPI API