Re: Deadlock in multiple CIC.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deadlock in multiple CIC.
Date: 2018-04-15 23:07:40
Message-ID: 6744.1523833660@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Awhile back, Alvaro Herrera wrote:
>> Pushed to all affected branches, along with a somewhat lame
>> isolationtester test for the condition (since we've already broken this
>> twice and not noticed for long).

> Buildfarm member okapi just failed this test in 9.4:

okapi has continued to fail that test, not 100% of the time but much
more often than not ... but only in 9.4. And no other animals have
shown it at all. So what to make of that?

Noting that okapi uses a pretty old icc version running at a high -O
level, we could dismiss it as probably-a-compiler-bug. But that theory
doesn't really account for the fact that it sometimes succeeds.

Another theory, noting that 9.5 and later have memory barriers in S_UNLOCK
which 9.4 lacks, is that the reason 9.4 has a problem is lack of a memory
barrier between SnapshotResetXmin and GetCurrentVirtualXIDs, thus allowing
both processes to observe the other's xmin as still nonzero given the
right timing. This seems like a stretch, because really the latter
function's LWLockAcquire on ProcArrayLock ought to be enough to serialize
things. But there has to be *something* different between 9.4 and all the
later branches, and the barrier stuff sure looks like it's in the right
neighborhood.

As an investigative measure, I propose that we insert

Assert(MyPgXact->xmin == InvalidTransactionId);

into 9.4's DefineIndex, just after its InvalidateCatalogSnapshot call.
I don't want to leave that there permanently, because it's not clear to me
that there are no legitimate cases where a backend wouldn't have extra
snapshots active during CREATE INDEX CONCURRENTLY --- but we seem to get
through 9.4's regression tests with it, and it would quickly confirm or
deny whether okapi is failing because it somehow has an extra snapshot.

Assuming that that doesn't show anything, I'm inclined to think that
the next step should be to add a pg_memory_barrier() call to
SnapshotResetXmin (again only in the 9.4 branch), and see if that helps.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yuriy Zhuravlev 2018-04-16 02:26:14 Re: Setting rpath on llvmjit.so?
Previous Message Peter Geoghegan 2018-04-15 22:05:18 Re: WIP: Covering + unique indexes.