Re: O(n) tasks cause lengthy startups and checkpoints

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-02-17 01:50:52
Message-ID: 20220217015052.2y3wxibeommk37ey@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-16 16:50:57 -0800, Nathan Bossart wrote:
> + * The custodian process is new as of Postgres 15.

I think this kind of comment tends to age badly and not be very useful.

> It's main purpose is to
> + * offload tasks that could otherwise delay startup and checkpointing, but
> + * it needn't be restricted to just those things. Offloaded tasks should
> + * not be synchronous (e.g., checkpointing shouldn't need to wait for the
> + * custodian to complete a task before proceeding). Also, ensure that any
> + * offloaded tasks are either not required during single-user mode or are
> + * performed separately during single-user mode.
> + *
> + * The custodian is not an essential process and can shutdown quickly when
> + * requested. The custodian will wake up approximately once every 5
> + * minutes to perform its tasks, but backends can (and should) set its
> + * latch to wake it up sooner.

Hm. This kind policy makes it easy to introduce bugs where the occasional runs
mask forgotten notifications etc.

> + * Normal termination is by SIGTERM, which instructs the bgwriter to
> + * exit(0).

s/bgwriter/.../

> Emergency termination is by SIGQUIT; like any backend, the
> + * custodian will simply abort and exit on SIGQUIT.
> + *
> + * If the custodian exits unexpectedly, the postmaster treats that the same
> + * as a backend crash: shared memory may be corrupted, so remaining
> + * backends should be killed by SIGQUIT and then a recovery cycle started.

This doesn't really seem useful stuff to me.

> + /*
> + * If an exception is encountered, processing resumes here.
> + *
> + * You might wonder why this isn't coded as an infinite loop around a
> + * PG_TRY construct. The reason is that this is the bottom of the
> + * exception stack, and so with PG_TRY there would be no exception handler
> + * in force at all during the CATCH part. By leaving the outermost setjmp
> + * always active, we have at least some chance of recovering from an error
> + * during error recovery. (If we get into an infinite loop thereby, it
> + * will soon be stopped by overflow of elog.c's internal state stack.)
> + *
> + * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
> + * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
> + * signals other than SIGQUIT will be blocked until we complete error
> + * recovery. It might seem that this policy makes the HOLD_INTERRUPS()
> + * call redundant, but it is not since InterruptPending might be set
> + * already.
> + */

I think it's bad to copy this comment into even more places.

> + /* Since not using PG_TRY, must reset error stack by hand */
> + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> + {

I also think it's a bad idea to introduce even more copies of the error
handling body. I think we need to unify this. And yes, it's unfair to stick
you with it, but it's been a while since a new aux process has been added.

> + /*
> + * These operations are really just a minimal subset of
> + * AbortTransaction(). We don't have very many resources to worry
> + * about.
> + */

Given what you're proposing this for, are you actually confident that we don't
need more than this?

> From d9826f75ad2259984d55fc04622f0b91ebbba65a Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn(at)amazon(dot)com>
> Date: Sun, 5 Dec 2021 19:38:20 -0800
> Subject: [PATCH v5 2/8] Also remove pgsql_tmp directories during startup.
>
> Presently, the server only removes the contents of the temporary
> directories during startup, not the directory itself. This changes
> that to prepare for future commits that will move temporary file
> cleanup to a separate auxiliary process.

Is this actually safe? Is there a guarantee no process can access a temp table
stored in one of these? Because without WAL guaranteeing consistency, we can't
just access e.g. temp tables written before a crash.

> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
> + bool unlink_all);

I don't like functions with multiple consecutive booleans, they tend to get
swapped around. Why not just split unlink_all=true/false into different
functions?

> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
>
> First, pgsql_tmp directories will be renamed to stage them for
> removal.

What if the target name already exists?

> Then, all files in pgsql_tmp are removed before removing
> the staged directories themselves. This change is being made in
> preparation for a follow-up change to offload most temporary file
> cleanup to the new custodian process.
>
> Note that temporary relation files cannot be cleaned up via the
> aforementioned strategy and will not be offloaded to the custodian.

This should be in the prior commit message, otherwise people will ask the same
question as I did.

> + /*
> + * Find a name for the stage directory. We just increment an integer at the
> + * end of the name until we find one that doesn't exist.
> + */
> + for (int n = 0; n <= INT_MAX; n++)
> + {
> + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
> + PG_TEMP_DIR_TO_REMOVE_PREFIX, n);

Uninterruptible loops up to INT_MAX do not seem like a good idea.

> + dir = AllocateDir(stage_path);
> + if (dir == NULL)
> + {

Why not just use stat()? That's cheaper, and there's no
time-to-check-time-to-use issue here, we're the only one writing.

> + if (errno == ENOENT)
> + break;
> +
> + ereport(LOG,
> + (errcode_for_file_access(),
> + errmsg("could not open directory \"%s\": %m",
> + stage_path)));

I think this kind of lenience is just hiding bugs.

> File
> PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
> @@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
> */
> spc_dir = AllocateDir("pg_tblspc");
>
> - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> + while (!ShutdownRequestPending &&
> + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)

Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
not do their jobs when ShutdownRequestPending is set, particularly without a
huge fat comment.

> {
> if (strcmp(spc_de->d_name, ".") == 0 ||
> strcmp(spc_de->d_name, "..") == 0)
> @@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
> * would create a race condition. It's done separately, earlier in
> * postmaster startup.
> */
> +
> + /*
> + * If we just staged some pgsql_tmp directories for removal, wake up the
> + * custodian process so that it deletes all the files in the staged
> + * directories as well as the directories themselves.
> + */
> + if (stage && ProcGlobal->custodianLatch)
> + SetLatch(ProcGlobal->custodianLatch);

Just signalling without letting the custodian know what it's expected to do
strikes me as a bad idea.

> From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn(at)amazon(dot)com>
> Date: Sun, 5 Dec 2021 22:02:40 -0800
> Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.
>
> This was only done during checkpoints because it was a convenient
> place to put it. However, if there are many snapshots to remove,
> it can significantly extend checkpoint time. To avoid this, move
> this work to the newly-introduced custodian process.
> ---
> src/backend/access/transam/xlog.c | 2 --
> src/backend/postmaster/custodian.c | 11 +++++++++++
> src/backend/replication/logical/snapbuild.c | 13 +++++++------
> src/include/replication/snapbuild.h | 2 +-
> 4 files changed, 19 insertions(+), 9 deletions(-)

Why does this not open us up to new xid wraparound issues? Before there was a
hard bound on how long these files could linger around. Now there's not
anymore.

> - while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
> + while (!ShutdownRequestPending &&
> + (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)

I really really strenuously object to these checks.

> Subject: [PATCH v5 6/8] Move removal of old logical rewrite mapping files to
> custodian.

> If there are many such files to remove, checkpoints can take much
> longer. To avoid this, move this work to the newly-introduced
> custodian process.

Same wraparound concerns.

> +#include "postmaster/bgwriter.h"

I think it's a bad idea to put these functions into bgwriter.h

> From cfca62dd55d7be7e0025e5625f18d3ab9180029c Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn(at)amazon(dot)com>
> Date: Mon, 13 Dec 2021 20:20:12 -0800
> Subject: [PATCH v5 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
> shutdown and end-of-recovery checkpoints.
>
> This may save quite a bit of time when there are many mapping files
> to flush to disk.

Seems like an a mostly independent proposal.

> +#ifdef HAVE_SYNCFS
> +
> + /*
> + * If we are doing a shutdown or end-of-recovery checkpoint, let's use
> + * syncfs() to flush the mappings to disk instead of flushing each one
> + * individually. This may save us quite a bit of time when there are many
> + * such files to flush.
> + */

I am doubtful this is a good idea. This will cause all dirty files to be
written back, even ones we don't need to be written back. At once. Very
possibly *slowing down* the shutdown.

What is even the theory of the case here? That there's so many dirty mapping
files that fsyncing them will take too long? That iterating would take too
long?

> From b5923b1b76a1fab6c21d6aec086219160473f464 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
> Date: Fri, 11 Feb 2022 09:43:57 -0800
> Subject: [PATCH v5 8/8] Move removal of spilled logical slot data to
> custodian.
>
> If there are many such files, startup can take much longer than
> necessary. To handle this, startup creates a new slot directory,
> copies the state file, and swaps the new directory with the old
> one. The custodian then asynchronously cleans up the old slot
> directory.

You guess it: I don't see what prevents wraparound issues.

> 5 files changed, 317 insertions(+), 9 deletions(-)

This seems such an increase in complexity and fragility that I really doubt
this is a good idea.

> +/*
> + * This function renames the given directory with a special suffix that the
> + * custodian will know to look for. An integer is appended to the end of the
> + * new directory name in case previously staged slot directories have not yet
> + * been removed.
> + */
> +static void
> +StageSlotDirForRemoval(const char *slotname, const char *slotpath)
> +{
> + char stage_path[MAXPGPATH];
> +
> + /*
> + * Find a name for the stage directory. We just increment an integer at the
> + * end of the name until we find one that doesn't exist.
> + */
> + for (int n = 0; n <= INT_MAX; n++)
> + {
> + DIR *dir;
> +
> + sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n);
> +
> + dir = AllocateDir(stage_path);
> + if (dir == NULL)
> + {
> + if (errno == ENOENT)
> + break;
> +
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not open directory \"%s\": %m",
> + stage_path)));
> + }
> + FreeDir(dir);
> +
> + stage_path[0] = '\0';
> + }

Copy of "find free name" logic.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-17 01:53:08 Re: libpq async duplicate error results
Previous Message Michael Paquier 2022-02-17 01:48:13 Re: adding 'zstd' as a compression algorithm