Re: Weird failure with latches in curculio on v15

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Fujii Masao <fujii(at)postgresql(dot)org>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-02 22:01:13
Message-ID: 20230202220113.GA3945808@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 02, 2023 at 04:14:54PM -0500, Robert Haas wrote:
> + /*
> + * When exitOnSigterm is set and we are in the startup process, use the
> + * special wrapper for system() that enables exiting immediately upon
> + * receiving SIGTERM. This ensures we can break out of system() if
> + * required.
> + */
>
> This comment, for me, raises more questions than it answers. Why do we
> only do this in the startup process?

Currently, this functionality only exists in the startup process because it
is only used for restore_command. More below...

> Also, and this part is not the fault of this patch but a defect of the
> pre-existing comments, under what circumstances do we not want to exit
> when we get a SIGTERM? It's standard behavior for PostgreSQL backends
> to exit when they receive SIGTERM, so the question isn't why we
> sometimes exit immediately but why we ever don't. The existing code
> calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and
> false in other cases, and AFAICS there are zero words of comments
> explaining the reasoning.

I've been digging into the history here. This e-mail seems to have the
most context [0]. IIUC this was intended to prevent "fast" shutdowns from
escalating to "immediate" shutdowns because the restore command died
unexpectedly. This doesn't apply to archive_cleanup_command because we
don't FATAL if it dies unexpectedly. It seems like this idea should apply
to recovery_end_command, too, but AFAICT it doesn't use the same approach.
My guess is that this hasn't come up because it's less likely that both 1)
recovery_end_command is used and 2) someone initiates shutdown while it is
running.

BTW the relevant commits are cdd46c7 (added SIGTERM handling for
restore_command), 9e403c2 (added recovery_end_command), and c21ac0b (added
what is today called archive_cleanup_command).

> + if (exitOnSigterm && MyBackendType == B_STARTUP)
> + rc = RunInterruptibleShellCommand(command);
> + else
> + rc = system(command);
>
> And this looks like pure magic. I'm all in favor of not relying on
> system(), but using it under some opaque set of conditions and
> otherwise doing something else is not the way. At the very least this
> needs to be explained a whole lot better.

If we applied this exit-on-SIGTERM behavior to recovery_end_command, I
think we could combine failOnSignal and exitOnSigterm into one flag, and
then it might be a little easier to explain what is going on. In any case,
I agree that this deserves a lengthy explanation, which I'll continue to
work on.

[0] https://postgr.es/m/499047FE.9090407%40enterprisedb.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-02-02 22:14:40 Re: Move defaults toward ICU in 16?
Previous Message Peter Eisentraut 2023-02-02 21:58:32 Remove unused code related to unknown type