Re: [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
Date: 2017-10-03 16:30:02
Message-ID: 22735.1507048202@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> Tom Lane wrote:
>> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
>>> I think the latch is only used locally. Seems that it was only put in
>>> shmem to avoid a separate variable ...

>> Hm, I'm strongly tempted to move it to a separate static variable then.

Oh, wait, look at

/*
* Wake up the walreceiver main loop.
*
* This is called by the startup process whenever interesting xlog records
* are applied, so that walreceiver can check if it needs to send an apply
* notification back to the master which may be waiting in a COMMIT with
* synchronous_commit = remote_apply.
*/
void
WalRcvForceReply(void)
{
WalRcv->force_reply = true;
if (WalRcv->latch)
SetLatch(WalRcv->latch);
}

So that's trouble waiting to happen, for sure. At the very least we
need to do a single fetch of WalRcv->latch, not two. I wonder whether
even that is sufficient, though: this coding requires an atomic fetch of
a pointer, which is something we generally don't assume to be safe.

I'm inclined to think that it'd be a good idea to move the set and
clear of the latch field into the nearby spinlock critical sections,
and then change WalRcvForceReply to look like

void
WalRcvForceReply(void)
{
Latch *latch;

WalRcv->force_reply = true;
/* fetching the latch pointer might not be atomic, so use spinlock */
SpinLockAcquire(&WalRcv->mutex);
latch = WalRcv->latch;
SpinLockRelease(&WalRcv->mutex);
if (latch)
SetLatch(latch);
}

This still leaves us with a hazard of applying SetLatch to the latch
of a PGPROC that is no longer the walreceiver's, but I think that's
okay (and we have the same problem elsewhere). At worst it leads to
an extra wakeup of the next process using that PGPROC.

I'm also thinking that the force_reply flag needs to be sig_atomic_t,
not bool, because bool store/fetch isn't necessarily atomic on all
platforms.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dang Minh Huong 2017-10-03 16:31:18 Re: list of credits for release notes
Previous Message Andres Freund 2017-10-03 16:23:17 Re: SendRowDescriptionMessage() is slow for queries with a lot of columns