Re: O(n) tasks cause lengthy startups and checkpoints

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-07-03 17:07:54
Message-ID: 20220703170754.GA1048244@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

Thanks for the prompt review.

On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote:
> On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
>> + /* Obtain requested tasks */
>> + SpinLockAcquire(&CustodianShmem->cust_lck);
>> + flags = CustodianShmem->cust_flags;
>> + CustodianShmem->cust_flags = 0;
>> + SpinLockRelease(&CustodianShmem->cust_lck);
>
> Just resetting the flags to 0 is problematic. Consider what happens if there's
> two tasks and and the one processed first errors out. You'll loose information
> about needing to run the second task.

I think we also want to retry any failed tasks. The way v6 handles this is
by requesting all tasks after an exception. Another way to handle this
could be to reset each individual flag before the task is executed, and
then we could surround each one with a PG_CATCH block that resets the flag.
I'll do it this way in the next revision.

>> +/*
>> + * RequestCustodian
>> + * Called to request a custodian task.
>> + */
>> +void
>> +RequestCustodian(int flags)
>> +{
>> + SpinLockAcquire(&CustodianShmem->cust_lck);
>> + CustodianShmem->cust_flags |= flags;
>> + SpinLockRelease(&CustodianShmem->cust_lck);
>> +
>> + if (ProcGlobal->custodianLatch)
>> + SetLatch(ProcGlobal->custodianLatch);
>> +}
>
> With this representation we can't really implement waiting for a task or
> such. And it doesn't seem like a great API for the caller to just specify a
> mix of flags.

At the moment, the idea is that nothing should need to wait for a task
because the custodian only handles things that are relatively non-critical.
If that changes, this could probably be expanded to look more like
RequestCheckpoint().

What would you suggest using instead of a mix of flags?

>> + /* Calculate how long to sleep */
>> + end_time = (pg_time_t) time(NULL);
>> + elapsed_secs = end_time - start_time;
>> + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
>> + continue; /* no sleep for us */
>> + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
>> +
>> + (void) WaitLatch(MyLatch,
>> + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>> + cur_timeout * 1000L /* convert to ms */ ,
>> + WAIT_EVENT_CUSTODIAN_MAIN);
>> + }
>
> I don't think we should have this thing wake up on a regular basis. We're
> doing way too much of that already, and I don't think we should add
> more. Either we need a list of times when tasks need to be processed and wake
> up at that time, or just wake up if somebody requests a task.

I agree. I will remove the timeout in the next revision.

>> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
>> index e67370012f..82aa0c6307 100644
>> --- a/src/backend/postmaster/postmaster.c
>> +++ b/src/backend/postmaster/postmaster.c
>> @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[])
>> * Remove old temporary files. At this point there can be no other
>> * Postgres processes running in this directory, so this should be safe.
>> */
>> - RemovePgTempFiles();
>> + RemovePgTempFiles(true, true);
>> + RemovePgTempFiles(false, false);
>
> This is imo hard to read and easy to get wrong. Make it multiple functions or
> pass named flags in.

Will do.

>> + * StagePgTempDirForRemoval
>> + *
>> + * This function renames the given directory with a special prefix that
>> + * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
>> + * the end of the new directory name in case previously staged pgsql_tmp
>> + * directories have not yet been removed.
>> + */
>
> It doesn't seem great to need to iterate through a directory that contains
> other files, potentially a significant number. How about having a
> staged_for_removal/ directory, and then only scanning that?

Yeah, that seems like a good idea. Will do.

>> + /*
>> + * Find a name for the stage directory. We just increment an integer at the
>> + * end of the name until we find one that doesn't exist.
>> + */
>> + for (int n = 0; n <= INT_MAX; n++)
>> + {
>> + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
>> + PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
>> +
>> + if (stat(stage_path, &statbuf) != 0)
>> + {
>> + if (errno == ENOENT)
>> + break;
>> +
>> + ereport(LOG,
>> + (errcode_for_file_access(),
>> + errmsg("could not stat file \"%s\": %m", stage_path)));
>> + return;
>> + }
>> +
>> + stage_path[0] = '\0';
>
> I still dislike this approach. Loops until INT_MAX, not interruptible... Can't
> we prevent conflicts by adding a timestamp or such?

I suppose it's highly unlikely that we'd see a conflict if we used the
timestamp instead. I'll do it this way in the next revision if that seems
good enough.

>> From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
>> From: Nathan Bossart <bossartn(at)amazon(dot)com>
>> Date: Sun, 5 Dec 2021 22:02:40 -0800
>> Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.
>>
>> This was only done during checkpoints because it was a convenient
>> place to put it.
>
> As mentioned before, having it done as part of checkpoints provides pretty
> decent wraparound protection - yes, it's not theoretically perfect, but in
> reality it's very unlikely you can have an xid wraparound within one
> checkpoint. I've mentioned this before, so at the very least I'd like to see
> this acknowledged in the commit message.

Will do.

>> + /* let the custodian know what it can remove */
>> + CustodianSetLogicalRewriteCutoff(cutoff);
>
> Setting this variable in a custodian datastructure and then fetching it from
> there seems architecturally wrong to me.

Where do you think it should go? I previously had it in the checkpointer's
shared memory, but you didn't like that the functions were declared in
bgwriter.h (along with the other checkpoint stuff). If the checkpointer
shared memory is the right place, should we create checkpointer.h and use
that instead?

>> + RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);
>
> What about single user mode?
>
>
> ISTM that RequestCustodian() needs to either assert out if called in single
> user mode, or execute tasks immediately in that context.

I like the idea of executing the tasks immediately since that's what
happens today in single-user mode. I will try doing it that way.

>> +/*
>> + * Remove all mappings not needed anymore based on the logical restart LSN saved
>> + * by the checkpointer. We use this saved value instead of calling
>> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
>> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
>> + * disk.
>> + */
>
> What interference could there be?

My concern is that the custodian could obtain a later cutoff than what the
checkpointer does, which might cause files to be concurrently unlinked and
fsync'd. If we always use the checkpointer's cutoff, that shouldn't be a
problem. This could probably be better explained in this comment.

>> +void
>> +RemoveOldLogicalRewriteMappings(void)
>> +{
>> + XLogRecPtr cutoff;
>> + DIR *mappings_dir;
>> + struct dirent *mapping_de;
>> + char path[MAXPGPATH + 20];
>> + bool value_set = false;
>> +
>> + cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
>> + if (!value_set)
>> + return;
>
> Afaics nothing clears values_set - is that a good idea?

I'm using value_set to differentiate the case where InvalidXLogRecPtr means
the checkpointer hasn't determined a value yet versus the case where it
has. In the former, we don't want to take any action. In the latter, we
want to unlink all the files. Since we're moving to a request model for
the custodian, I might be able to remove this value_set stuff completely.
If that's not possible, it probably deserves a better comment.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-07-03 17:08:36 Re: SQL/JSON functions vs. ECPG vs. STRING as a reserved word
Previous Message vignesh C 2022-07-03 14:59:37 Re: Handle infinite recursion in logical replication setup