Re: Weird failure with latches in curculio on v15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Fujii Masao <fujii(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-01 16:58:01
Message-ID: 20230201165801.33ydbxvjdbomjqa7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> >> handler which is allowed to call that while in_restore_command is
> >> true.
>
> > Ugh, no wonder we're getting crashes. This whole business seems bogus as
> > hell.
>
> Indeed :-(
>
> > I don't see a choice but to revert the recent changes. They need a
> > fairly large rewrite.
>
> 9a740f81e clearly made things a lot worse, but it wasn't great
> before. Can we see a way forward to removing the problem entirely?

Yea, I think we can - we should stop relying on system(). If we instead
run the command properly as a subprocess, we don't need to do bad things
in the signal handler anymore.

> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call. That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

We could push the functions setting in_restore_command down into
ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
being right either - we'd now use the mechanism in places we previously
didn't (cleanup/end commands).

And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
doesn't look right:
- We now have two places open-coding what BuildRestoreCommand did

- I'm doubtful that the new shell_* functions are the base for a good
API to abstract restoring files

- the error message for a failed restore command seems to have gotten
worse:
could not restore file \"%s\" from archive: %s"
->
"%s \"%s\": %s", commandName, command

- shell_* imo is not a good namespace for something called from xlog.c,
xlogarchive.c. I realize the intention is that shell_archive.c is
going to be its own "restore module", but for now it imo looks odd

- The comment moved out of RestoreArchivedFile() doesn't seems less
useful at its new location

- explanation of why we use GetOldestRestartPoint() is halfway lost

My name is listed as the first Reviewed-by, but I certainly haven't done
any meaningful review of these patches. I just replied to top-level
email proposing "recovery modules".

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-02-01 17:08:24 Re: Weird failure with latches in curculio on v15
Previous Message Pavel Borisov 2023-02-01 16:57:45 Re: About PostgreSQL Core Team