Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
Date: 2022-04-11 22:33:28
Message-ID: CA+hUKGL7ZFiX5yrbTRSjwH_x=2m40cobGewxu+XBKu0Dbh5N-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 10, 2022 at 11:00 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-04-09 14:39:16 -0700, Andres Freund wrote:
> > On 2022-04-09 17:00:41 -0400, Tom Lane wrote:
> > > Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > > > Unlike most "procsignal" handler routines, RecoveryConflictInterrupt()
> > > > doesn't just set a sig_atomic_t flag and poke the latch. Is the extra
> > > > stuff it does safe? For example, is this call stack OK (to pick one
> > > > that jumps out, but not the only one)?
> > >
> > > > procsignal_sigusr1_handler
> > > > -> RecoveryConflictInterrupt
> > > > -> HoldingBufferPinThatDelaysRecovery
> > > > -> GetPrivateRefCount
> > > > -> GetPrivateRefCountEntry
> > > > -> hash_search(...hash table that might be in the middle of an update...)
> > >
> > > Ugh. That one was safe before somebody decided we needed a hash table
> > > for buffer refcounts, but it's surely not safe now.
> >
> > Mea culpa. This is 4b4b680c3d6d - from 2014.
>
> Whoa. There's way worse: StandbyTimeoutHandler() calls
> SendRecoveryConflictWithBufferPin(), which calls CancelDBBackends(), which
> acquires lwlocks etc.
>
> Which very plausibly is the cause for the issue I'm investigating in
> https://www.postgresql.org/message-id/20220409220054.fqn5arvbeesmxdg5%40alap3.anarazel.de

Huh. I wouldn't have started a separate thread for this if I'd
realised I was getting close to the cause of the CI failure... I
thought this was an incidental observation. Anyway, I made a first
attempt at fixing this SIGUSR1 problem (I think Andres is looking at
the SIGALRM problem in the other thread).

Instead of bothering to create N different XXXPending variables for
the different conflict "reasons", I used an array. Other than that,
it's much like existing examples.

The existing use of the global variable RecoveryConflictReason seems a
little woolly. Doesn't it get clobbered every time a signal arrives,
even if we determine that there is no conflict? Not sure why that's
OK, but anyway, this patch always sets it together with
RecoveryConflictPending = true.

Attachment Content-Type Size
0001-Fix-recovery-conflict-SIGUSR1-handling.patch text/x-patch 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-04-11 22:50:29 Re: Fixes for compression options of pg_receivewal and refactoring of backup_compression.{c,h}
Previous Message Tom Lane 2022-04-11 21:44:45 Re: Fixing code that ignores failure of XLogRecGetBlockTag