Re: Weird failure with latches in curculio on v15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Fujii Masao <fujii(at)postgresql(dot)org>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-04 11:30:29
Message-ID: 20230204113029.xlcqrbxhp6lerrnc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-02 11:06:19 +0900, Michael Paquier wrote:
> > - the error message for a failed restore command seems to have gotten
> > worse:
> > could not restore file \"%s\" from archive: %s"
> > ->
> > "%s \"%s\": %s", commandName, command
>
> IMO, we don't lose any context with this method: the command type and
> the command string itself are the bits actually relevant. Perhaps
> something like that would be more intuitive? One idea:
> "could not execute command for %s: %s", commandName, command

We do - you now can't identify the filename that failed without parsing
the command, which obviously isn't easily possible generically, because,
well, it's user-configurable. I like that the command is logged now,
but I think we need the filename be at a predictable position in addition.

> > - shell_* imo is not a good namespace for something called from xlog.c,
> > xlogarchive.c. I realize the intention is that shell_archive.c is
> > going to be its own "restore module", but for now it imo looks odd
>
> shell_restore.c does not sound that bad to me, FWIW. The parallel
> with the archive counterparts is here. My recent history is not that
> good when it comes to naming, based on the feedback I received,
> though.

I don't mind shell_restore.c much - after all, the filename is
namespaced by the directory. However, I do mind function names like
shell_restore(), that could also be about restoring what SHELL is set
to, or whatever. And functions aren't namespaced in C.

> > And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> > doesn't look right:
> > - We now have two places open-coding what BuildRestoreCommand did
>
> Yeah, BuildRestoreCommand() was just a small wrapper on top of the new
> percentrepl.c, making it rather irrelevant at this stage, IMO. For
> the two code paths where it was called.

I don't at all agree. Particularly because you didn't even leave a
pointer in each of the places that if you update one, you also need to
update the other. I don't mind the amount of code it adds, I do mind
that it, without any recognizable reason, implements policy in multiple
places.

> > - The comment moved out of RestoreArchivedFile() doesn't seems less
> > useful at its new location
>
> We are talking about that:
> - /*
> - * Remember, we rollforward UNTIL the restore fails so failure here is
> - * just part of the process... that makes it difficult to determine
> - * whether the restore failed because there isn't an archive to restore,
> - * or because the administrator has specified the restore program
> - * incorrectly. We have to assume the former.
> - *
> - * However, if the failure was due to any sort of signal, it's best to
> - * punt and abort recovery. (If we "return false" here, upper levels will
> - * assume that recovery is complete and start up the database!) It's
> - * essential to abort on child SIGINT and SIGQUIT, because per spec
> - * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
> - * those it's a good bet we should have gotten it too.
> - *
> - * On SIGTERM, assume we have received a fast shutdown request, and exit
> - * cleanly. It's pure chance whether we receive the SIGTERM first, or the
> - * child process. If we receive it first, the signal handler will call
> - * proc_exit, otherwise we do it here. If we or the child process received
> - * SIGTERM for any other reason than a fast shutdown request, postmaster
> - * will perform an immediate shutdown when it sees us exiting
> - * unexpectedly.
> - *
> - * We treat hard shell errors such as "command not found" as fatal, too.
> - */
>
> The failure processing is stuck within the way we build and handle the
> command given down to system(), so keeping this in shell_restore.c (or
> whatever name you think would be a better fit) makes sense to me.
> Now, thinking a bit more of this, we could just push the description
> down to ExecuteRecoveryCommand(), that actually does the work,
> adaptinh the comment based on the refactored internals of the
> routine.

Nothing in the shell_restore() comments explains / asserts that it is
only ever to be called in the startup process. And outside of the
startup process none of the above actually makes sense.

That's kind of my problem with these changes. They try to introduce new
abstraction layers, but don't provide real abstraction, because they're
very tightly bound to the way the functions were called before the
refactoring. And none of these restrictions are actually documented.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-04 11:34:53 Re: run pgindent on a regular basis / scripted manner
Previous Message Amit Kapila 2023-02-04 11:24:10 Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher