Re: Weird failure with latches in curculio on v15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <fujii(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-01 10:55:14
Message-ID: 20230201105514.rsjl4bnhb65giyvo@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> My database off assertion failures has four like that, all 15 and master:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02
>
> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> handler which is allowed to call that while in_restore_command is
> true.

Ugh, no wonder we're getting crashes. This whole business seems bogus as
hell.

RestoreArchivedFile():
...
/*
* Check signals before restore command and reset afterwards.
*/
PreRestoreCommand();

/*
* Copy xlog from archival storage to XLOGDIR
*/
ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);

PostRestoreCommand();

/* SIGTERM: set flag to abort redo and exit */
static void
StartupProcShutdownHandler(SIGNAL_ARGS)
{
int save_errno = errno;

if (in_restore_command)
proc_exit(1);
else
shutdown_requested = true;
WakeupRecovery();

errno = save_errno;
}

Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command.

There's *a lot* of stuff happening inside shell_restore() that's not
compatible with doing proc_exit() inside a signal handler. We're
allocating memory! Interact with stdout.

There's also the fact that system() isn't signal safe, but that's a much
less likely problematic issue.

This appears to have gotten worse over a sequence of commits. The
following commits each added something betwen PreRestoreCommand() and
PostRestoreCommand().

commit 1b06d7bac901e5fd20bba597188bae2882bf954b
Author: Fujii Masao <fujii(at)postgresql(dot)org>
Date: 2021-11-22 10:28:21 +0900

Report wait events for local shell commands like archive_command.

added pgstat_report_wait_start/end. Unlikely to cause big issues, but
not good.

commit 7fed801135bae14d63b11ee4a10f6083767046d8
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2022-08-29 13:55:38 -0400

Clean up inconsistent use of fflush().

Made it a bit worse by adding an fflush(). That certainly seems like it
could cause hangs.

commit 9a740f81eb02e04179d78f3df2ce671276c27b07
Author: Michael Paquier <michael(at)paquier(dot)xyz>
Date: 2023-01-16 16:31:43 +0900

Refactor code in charge of running shell-based recovery commands

which completely broke the mechanism. We suddenly run the entirety of
shell_restore(), which does pallocs etc to build the string passed to
system, and raises errors, all within a section in which a signal
handler can invoke proc_exit(). That's just completely broken.

Sorry, but particularly in this area, you got to be a heck of a lot more
careful.

I don't see a choice but to revert the recent changes. They need a
fairly large rewrite.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-02-01 11:21:20 Re: heapgettup refactoring
Previous Message Dagfinn Ilmari Mannsåker 2023-02-01 10:50:42 Re: Clarify deleting comments and security labels in synopsis