Re: xlc atomics

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>
Subject: Re: xlc atomics
Date: 2016-04-24 01:54:07
Message-ID: 20160424015407.GD2084318@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 15, 2016 at 02:02:41PM -0500, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
> > That commit (0d32d2e) permitted things to compile and usually pass tests, but
> > I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen
> > sixteen duplicate-catalog-OID failures.
>
> I'd been wondering about those ...
>
> > These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
> > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
> > 5765-J08)" for ppc64le. The problem is generic-xlc.h
> > pg_atomic_compare_exchange_u32_impl() issuing __isync() before
> > __compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
> > own s_lock.h, its references, and other projects' usage:
>
> Nice catch!
>
> > This patch's test case would have failed about half the time under today's
> > generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug
> > 99% of the time would run far longer, hence this compromise.
>
> Sounds like a reasonable compromise to me, although I wonder about the
> value of it if we stick it into pgbench's TAP tests. How many of the
> slower buildfarm members are running the TAP tests? Certainly mine are
> not.

I missed a second synchronization bug in generic-xlc.h, but the pgbench test
suite caught it promptly after commit 008608b. Buildfarm members mandrill and
hornet failed[1] or deadlocked within two runs. The bug is that the
pg_atomic_compare_exchange_*() specifications grant "full barrier semantics",
but generic-xlc.h provided only the semantics of an acquire barrier. Commit
008608b exposed that older problem; its LWLockWaitListUnlock() relies on
pg_atomic_compare_exchange_u32() (via pg_atomic_fetch_and_u32()) for release
barrier semantics.

The fix is to issue "sync" before each compare-and-swap, in addition to the
"isync" after it. This is consistent with the corresponding GCC atomics. The
pgbench test suite survived dozens of runs so patched.

[1] http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2016-04-11%2003%3A14%3A13

Attachment Content-Type Size
full-barrier-xlc-v1.patch text/plain 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-04-24 02:51:17 Re: Suspicious behaviour on applying XLOG_HEAP2_VISIBLE.
Previous Message Noah Misch 2016-04-23 23:11:36 Re: pg_dump dump catalog ACLs