Re: recovery modules

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: recovery modules
Date: 2023-01-23 02:44:57
Message-ID: Y830qTRZJP6XGuOF@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 17, 2023 at 08:44:27PM -0800, Nathan Bossart wrote:
> Thanks. Here's a rebased version of the last patch.

Thanks for the rebase.

The final state of the documentation is as follows:
51. Archive and Recovery Modules
51.1. Archive Module Initialization Functions
51.2. Archive Module Callbacks
51.3. Recovery Module Initialization Functions
51.4. Recovery Module Callbacks

I am not completely sure whether this grouping is the best thing to
do. Wouldn't it be better to move that into two different
sub-sections instead? One layout suggestion:
51. WAL Modules
51.1. Archive Modules
51.1.1. Archive Module Initialization Functions
51.1.2. Archive Module Callbacks
51.2. Recovery Modules
51.2.1. Recovery Module Initialization Functions
51.2.2. Recovery Module Callbacks

Putting both of them in the same section sounds like a good idea per
the symmetry that one would likely have between the code paths of
archiving and recovery, so as they share some common knowledge.

This kinds of comes back to the previous suggestion to rename
basic_archive to something like wal_modules, that covers both
archiving and recovery. I does not seem like this would overlap with
RMGRs, which is much more internal anyway.

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("must specify restore_command when standby mode is not enabled")));
+ errmsg("must specify restore_command or a restore_library that defines "
+ "a restore callback when standby mode is not enabled")));
This is long. Shouldn't this be split into an errdetail() to list all
the options at hand?

- if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("both archive_command and archive_library set"),
- errdetail("Only one of archive_command, archive_library may be set.")));
+ CheckMutuallyExclusiveGUCs(XLogArchiveLibrary, "archive_library",
+ XLogArchiveCommand, "archive_command");

The introduction of this routine could be a patch on its own, as it
impacts the archiving path.

+ CheckMutuallyExclusiveGUCs(restoreLibrary, "restore_library",
+ archiveCleanupCommand, "archive_cleanup_command");
+ if (strcmp(prevRestoreLibrary, restoreLibrary) != 0 ||
+ strcmp(prevArchiveCleanupCommand, archiveCleanupCommand) != 0)
+ {
+ call_recovery_module_shutdown_cb(0, (Datum) 0);
+ LoadRecoveryCallbacks();
+ }
+
+ pfree(prevRestoreLibrary);
+ pfree(prevArchiveCleanupCommand);

Hm.. The callers of CheckMutuallyExclusiveGUCs() with the new ERROR
paths they introduce need a close lookup. As far as I can see this
concerns four areas depending on the three restore commands
(restore_command and recovery_end command for startup,
archive_cleanup_command for the checkpointer):
- Startup process initialization, as of validateRecoveryParameters()
where the postmaster GUCs for the recovery target are evaluated. This
one is an early stage which is fine.
- Startup reloading, as of StartupRereadConfig(). This code could
involve a WAL receiver restart depending on a change in the slot
change or in primary_conninfo, and force at the same time an ERROR
because of conflicting recovery library and command configuration.
This one should be safe because pendingWalRcvRestart would just be
considered later on by the startup process itself while waiting for
WAL to become available. Still this could deserve a comment? Even if
there is a misconfiguration, a reload on a standby would enforce a
FATAL in the startup process, taking down the whole server.
- Checkpointer initialization, as of CheckpointerMain(). A
configuration failure in this code path, aka server startup, causes
the server to loop infinitely on FATAL with the misconfiguration
showing up all the time.. This is a problem.
- Last comes the checkpointer GUC reloading, as of
HandleCheckpointerInterrupts(), with a second problem. This
introduces a failure path where ConfigReloadPending is processed at
the same time as ShutdownRequestPending based on the way it is coded,
interacting with what would be a normal shutdown in some cases? And
actually, if you enforce a misconfiguration on reload, the
checkpointer reports an error but it does not enforce a process
restart, hence it keeps around the new, incorrect, configuration while
waiting for a new checkpoint to happen once restore_library and
archive_cleanup_command are set. This could lead to surprises, IMO.
Upgrading to a FATAL in this code path triggers an infinite loop, like
the startup path.

If the archive_cleanup_command ballback of a restore library triggers
a FATAL, it seems to me that it would continuously trigger a server
restart, actually. Perhaps that's something to document, in
comparison to the safe fallbacks of the shell command where we don't
force an ERROR to give priority to the stability of the checkpointer.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Isaac Morland 2023-01-23 02:50:29 Re: Remove source code display from \df+?
Previous Message Justin Pryzby 2023-01-23 02:37:32 Re: Remove source code display from \df+?