Re: Deadlock in multiple CIC.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deadlock in multiple CIC.
Date: 2018-04-17 16:23:18
Message-ID: 6409.1523982198@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> So we can now refine the problem statement to "SnapshotResetXmin isn't
> doing what it's supposed to". No idea why yet. 9.4 is using a simple
> RegisteredSnapshots counter which 9.5 has replaced with a pairing heap,
> so you'd think the newer code would be *more* likely to have bugs...

It's still not entirely clear what's happening on okapi, but in the
meantime I've thought of an easily-reproducible way to cause similar
failures in any branch. That is to run CREATE INDEX CONCURRENTLY
with default_transaction_isolation = serializable. Then, snapmgr.c
will set up a transaction snapshot (actually identical to the
"reference snapshot" used by DefineIndex), and that will not get
released, so the process's xmin doesn't get cleared, and we have
a deadlock hazard.

I experimented with running the isolation tests under "alter system set
default_transaction_isolation to serializable". Oddly, multiple-cic
tends to not fail that way for me, though if I reduce the
isolation_schedule file to contain just that one test, it fails nine
times out of ten. Leftover activity from the previous tests must be
messing up the timing somehow. Anyway, the problem is definitely real.
(A couple of the other isolation tests do fail reliably under this
scenario; is it worth hardening them?)

I thought for a bit about trying to force C.I.C.'s transactions to
be run with a lower transaction isolation level, but that seems messy
and I'm not very sure it wouldn't have bad side-effects. A much simpler
fix is to just start YA transaction before waiting, as in the attached
proposed patch. (With the transaction restart, I feel sufficiently
confident that there should be no open snapshots that it seems okay
to put in the Assert I was previously afraid to add.)

I don't know whether this would make okapi's problem go away.
But it's seeming somewhat likely at this point that we're hitting
a weird compiler misoptimization there, and this might dodge it.
In any case this is demonstrably fixing a problem.

regards, tom lane

Attachment Content-Type Size
fix-serializable-cic-deadlock.patch text/x-diff 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2018-04-17 16:28:19 Re: [HACKERS] proposal: schema variables
Previous Message Fujii Masao 2018-04-17 16:12:54 Re: Speedup of relation deletes during recovery