Re: delay starting WAL receiver

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: delay starting WAL receiver
Date: 2023-01-11 05:47:54
Message-ID: 20230111054754.GA1622234@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 11, 2023 at 05:20:38PM +1300, Thomas Munro wrote:
> Is the problem here that SIGCHLD is processed ...
>
> PG_SETMASK(&UnBlockSig); <--- here?
>
> selres = select(nSockets, &rmask, NULL, NULL, &timeout);
>
> Meanwhile the SIGCHLD handler code says:
>
> * Was it the wal receiver? If exit status is zero (normal) or one
> * (FATAL exit), we assume everything is all right just like normal
> * backends. (If we need a new wal receiver, we'll start one at the
> * next iteration of the postmaster's main loop.)
>
> ... which is true, but that won't be reached for a while in this case
> if the timeout has already been set to 60s. Your patch makes that
> 100ms, in that case, a time delay that by now attracts my attention
> like a red rag to a bull (I don't know why you didn't make it 0).

I think this is right. At the very least, it seems consistent with my
observations.

> I'm not sure, but if I got that right, then I think the whole problem
> might automatically go away with CF #4032. The SIGCHLD processing
> code will run not when signals are unblocked before select() (that is
> gone), but instead *after* the event loop wakes up with WL_LATCH_SET,
> and runs the handler code in the regular user context before dropping
> through to the rest of the main loop.

Yeah, with those patches, the problem goes away. IIUC the key part is that
the postmaster's latch gets set when SIGCHLD is received, so even if
SIGUSR1 and SIGCHLD are processed out of order, WalReceiverPID gets cleared
and we try to restart it shortly thereafter. I find this much easier to
reason about.

I'll go ahead and withdraw this patch from the commitfest. Thanks for
chiming in.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-01-11 05:58:06 Re: MultiXact\SLRU buffers configuration
Previous Message Michael Paquier 2023-01-11 05:36:17 Re: Spinlock is missing when updating two_phase of ReplicationSlot