Re: O(n) tasks cause lengthy startups and checkpoints

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-07-02 22:54:56
Message-ID: 20220702225456.zit5kjdtdfqmjujt@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
> + /* Obtain requested tasks */
> + SpinLockAcquire(&CustodianShmem->cust_lck);
> + flags = CustodianShmem->cust_flags;
> + CustodianShmem->cust_flags = 0;
> + SpinLockRelease(&CustodianShmem->cust_lck);

Just resetting the flags to 0 is problematic. Consider what happens if there's
two tasks and and the one processed first errors out. You'll loose information
about needing to run the second task.

> + /* TODO: offloaded tasks go here */

Seems we're going to need some sorting of which tasks are most "urgent" / need
to be processed next if we plan to make this into some generic facility.

> +/*
> + * RequestCustodian
> + * Called to request a custodian task.
> + */
> +void
> +RequestCustodian(int flags)
> +{
> + SpinLockAcquire(&CustodianShmem->cust_lck);
> + CustodianShmem->cust_flags |= flags;
> + SpinLockRelease(&CustodianShmem->cust_lck);
> +
> + if (ProcGlobal->custodianLatch)
> + SetLatch(ProcGlobal->custodianLatch);
> +}

With this representation we can't really implement waiting for a task or
such. And it doesn't seem like a great API for the caller to just specify a
mix of flags.

> + /* Calculate how long to sleep */
> + end_time = (pg_time_t) time(NULL);
> + elapsed_secs = end_time - start_time;
> + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
> + continue; /* no sleep for us */
> + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
> +
> + (void) WaitLatch(MyLatch,
> + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> + cur_timeout * 1000L /* convert to ms */ ,
> + WAIT_EVENT_CUSTODIAN_MAIN);
> + }

I don't think we should have this thing wake up on a regular basis. We're
doing way too much of that already, and I don't think we should add
more. Either we need a list of times when tasks need to be processed and wake
up at that time, or just wake up if somebody requests a task.

> From 5e95666efa31d6c8aa351e430c37ead6e27acb72 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn(at)amazon(dot)com>
> Date: Sun, 5 Dec 2021 21:16:44 -0800
> Subject: [PATCH v6 3/6] Split pgsql_tmp cleanup into two stages.
>
> First, pgsql_tmp directories will be renamed to stage them for
> removal. Then, all files in pgsql_tmp are removed before removing
> the staged directories themselves. This change is being made in
> preparation for a follow-up change to offload most temporary file
> cleanup to the new custodian process.

> Note that temporary relation files cannot be cleaned up via the
> aforementioned strategy and will not be offloaded to the custodian.

> ---
> src/backend/postmaster/postmaster.c | 8 +-
> src/backend/storage/file/fd.c | 174 ++++++++++++++++++++++++----
> src/include/storage/fd.h | 2 +-
> 3 files changed, 160 insertions(+), 24 deletions(-)
>
> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
> index e67370012f..82aa0c6307 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[])
> * Remove old temporary files. At this point there can be no other
> * Postgres processes running in this directory, so this should be safe.
> */
> - RemovePgTempFiles();
> + RemovePgTempFiles(true, true);
> + RemovePgTempFiles(false, false);

This is imo hard to read and easy to get wrong. Make it multiple functions or
pass named flags in.

> + * StagePgTempDirForRemoval
> + *
> + * This function renames the given directory with a special prefix that
> + * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
> + * the end of the new directory name in case previously staged pgsql_tmp
> + * directories have not yet been removed.
> + */

It doesn't seem great to need to iterate through a directory that contains
other files, potentially a significant number. How about having a
staged_for_removal/ directory, and then only scanning that?

> +static void
> +StagePgTempDirForRemoval(const char *tmp_dir)
> +{
> + DIR *dir;
> + char stage_path[MAXPGPATH * 2];
> + char parent_path[MAXPGPATH * 2];
> + struct stat statbuf;
> +
> + /*
> + * If tmp_dir doesn't exist, there is nothing to stage.
> + */
> + dir = AllocateDir(tmp_dir);
> + if (dir == NULL)
> + {
> + if (errno != ENOENT)
> + ereport(LOG,
> + (errcode_for_file_access(),
> + errmsg("could not open directory \"%s\": %m", tmp_dir)));
> + return;
> + }
> + FreeDir(dir);
> +
> + strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
> + get_parent_directory(parent_path);
> +
> + /*
> + * get_parent_directory() returns an empty string if the input argument is
> + * just a file name (see comments in path.c), so handle that as being the
> + * current directory.
> + */
> + if (strlen(parent_path) == 0)
> + strlcpy(parent_path, ".", MAXPGPATH * 2);
> +
> + /*
> + * Find a name for the stage directory. We just increment an integer at the
> + * end of the name until we find one that doesn't exist.
> + */
> + for (int n = 0; n <= INT_MAX; n++)
> + {
> + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
> + PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
> +
> + if (stat(stage_path, &statbuf) != 0)
> + {
> + if (errno == ENOENT)
> + break;
> +
> + ereport(LOG,
> + (errcode_for_file_access(),
> + errmsg("could not stat file \"%s\": %m", stage_path)));
> + return;
> + }
> +
> + stage_path[0] = '\0';

I still dislike this approach. Loops until INT_MAX, not interruptible... Can't
we prevent conflicts by adding a timestamp or such?

> + }
> +
> + /*
> + * In the unlikely event that we couldn't find a name for the stage
> + * directory, bail out.
> + */
> + if (stage_path[0] == '\0')
> + {
> + ereport(LOG,
> + (errmsg("could not stage \"%s\" for deletion",
> + tmp_dir)));
> + return;
> + }

That's imo very much not ok. Just continuing in unexpected situations is a
recipe for introducing bugs / being hard to debug.

> From 43042799b96b588a446c509637b5acf570e2a325 Mon Sep 17 00:00:00 2001

> From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn(at)amazon(dot)com>
> Date: Sun, 5 Dec 2021 22:02:40 -0800
> Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.
>
> This was only done during checkpoints because it was a convenient
> place to put it.

As mentioned before, having it done as part of checkpoints provides pretty
decent wraparound protection - yes, it's not theoretically perfect, but in
reality it's very unlikely you can have an xid wraparound within one
checkpoint. I've mentioned this before, so at the very least I'd like to see
this acknowledged in the commit message.

> However, if there are many snapshots to remove, it can significantly extend
> checkpoint time.

I'd really like to see a reproducer or profile for this...

> + /*
> + * Remove serialized snapshots that are no longer required by any
> + * logical replication slot.
> + *
> + * It is not important for these to be removed in single-user mode, so
> + * we don't need any extra handling outside of the custodian process for
> + * this.
> + */

I don't think this claim is correct.

> From 0add8bb19a4ee83c6a6ec1f313329d737bf304a5 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn(at)amazon(dot)com>
> Date: Sun, 12 Dec 2021 22:07:11 -0800
> Subject: [PATCH v6 6/6] Move removal of old logical rewrite mapping files to
> custodian.
>
> If there are many such files to remove, checkpoints can take much
> longer. To avoid this, move this work to the newly-introduced
> custodian process.

As above I'd like to know why this could take that long. What are you doing
that there's so many mapping files (which only exist for catalog tables!) that
this is a significant fraction of a checkpoint?

> ---
> src/backend/access/heap/rewriteheap.c | 79 +++++++++++++++++++++++----
> src/backend/postmaster/custodian.c | 44 +++++++++++++++
> src/include/access/rewriteheap.h | 1 +
> src/include/postmaster/custodian.h | 5 ++
> 4 files changed, 119 insertions(+), 10 deletions(-)
>
> diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
> index 2a53826736..edeab65e60 100644
> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -116,6 +116,7 @@
> #include "lib/ilist.h"
> #include "miscadmin.h"
> #include "pgstat.h"
> +#include "postmaster/custodian.h"
> #include "replication/logical.h"
> #include "replication/slot.h"
> #include "storage/bufmgr.h"
> @@ -1182,7 +1183,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
> * Perform a checkpoint for logical rewrite mappings
> *
> * This serves two tasks:
> - * 1) Remove all mappings not needed anymore based on the logical restart LSN
> + * 1) Alert the custodian to remove all mappings not needed anymore based on the
> + * logical restart LSN
> * 2) Flush all remaining mappings to disk, so that replay after a checkpoint
> * only has to deal with the parts of a mapping that have been written out
> * after the checkpoint started.
> @@ -1210,6 +1212,10 @@ CheckPointLogicalRewriteHeap(void)
> if (cutoff != InvalidXLogRecPtr && redo < cutoff)
> cutoff = redo;
>
> + /* let the custodian know what it can remove */
> + CustodianSetLogicalRewriteCutoff(cutoff);

Setting this variable in a custodian datastructure and then fetching it from
there seems architecturally wrong to me.

> + RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);

What about single user mode?

ISTM that RequestCustodian() needs to either assert out if called in single
user mode, or execute tasks immediately in that context.

> +
> +/*
> + * Remove all mappings not needed anymore based on the logical restart LSN saved
> + * by the checkpointer. We use this saved value instead of calling
> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
> + * disk.
> + */

What interference could there be?

> +void
> +RemoveOldLogicalRewriteMappings(void)
> +{
> + XLogRecPtr cutoff;
> + DIR *mappings_dir;
> + struct dirent *mapping_de;
> + char path[MAXPGPATH + 20];
> + bool value_set = false;
> +
> + cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
> + if (!value_set)
> + return;

Afaics nothing clears values_set - is that a good idea?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2022-07-03 00:13:41 Re: First draft of the PG 15 release notes
Previous Message Nathan Bossart 2022-07-02 22:16:35 Re: replacing role-level NOINHERIT with a grant-level option