Re: Logical replication: stuck spinlock at ReplicationSlotRelease

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication: stuck spinlock at ReplicationSlotRelease
Date: 2017-06-24 02:50:51
Message-ID: 25203.1498272651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> writes:
> Do you want to take a look at move those elog calls around a bit? That
> should do it.

It would be a good idea to have some clarity on *why* that should do it.

Looking at the original report's log, but without having actually
reproduced the problem, I guess what is happening is this:

1. Subscription worker process (23117) gets a duplicate key conflict while
trying to apply an update, and in consequence it exits. (Is that supposed
to happen?)

2. Publication server process (23124) doesn't notice client connection
loss right away. By chance, the next thing it tries to send to the client
is the debug output from LogicalIncreaseRestartDecodingForSlot. Then it
detects loss of connection (at 2017-06-21 14:55:12.033) and FATAL's out.
But since the spinlock stuff has no tracking infrastructure, we don't
know we are still holding the replication slot mutex.

3. Process exit cleanup does know that it's supposed to release the
replication slot, so it tries to take the mutex spinlock ... again.
Eventually that times out and we get the "stuck spinlock" panic.

All correct so far?

So, okay, the proximate cause of the crash is a blatant violation of the
rule that spinlocks may only be held across straight-line code segments.
But I'm wondering about the client exit having occurred in the first
place. Why is that, and how would one ever recover? It sure looks
like this isn't the first subscription worker process that has tried
and failed to apply the update. If our attitude towards this situation is
that it's okay to fork-bomb your server with worker processes continually
respawning and making no progress, well, I don't think that's good enough.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-06-24 03:23:55 Re: An attempt to reduce WALWriteLock contention
Previous Message Peter Eisentraut 2017-06-24 01:42:10 Re: Get stuck when dropping a subscription during synchronizing table