Re: O(n) tasks cause lengthy startups and checkpoints

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-07-03 17:27:32
Message-ID: 20220703172732.wembjsb55xl63vuw@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-07-03 10:07:54 -0700, Nathan Bossart wrote:
> Thanks for the prompt review.
>
> On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote:
> > On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
> >> + /* Obtain requested tasks */
> >> + SpinLockAcquire(&CustodianShmem->cust_lck);
> >> + flags = CustodianShmem->cust_flags;
> >> + CustodianShmem->cust_flags = 0;
> >> + SpinLockRelease(&CustodianShmem->cust_lck);
> >
> > Just resetting the flags to 0 is problematic. Consider what happens if there's
> > two tasks and and the one processed first errors out. You'll loose information
> > about needing to run the second task.
>
> I think we also want to retry any failed tasks.

I don't think so, at least not if it's just going to retry that task straight
away - then we'll get stuck on that one task forever. If we had the ability to
"queue" it the end, to be processed after other already dequeued tasks, it'd
be a different story.

> The way v6 handles this is by requesting all tasks after an exception.

Ick. That strikes me as a bad idea.

> >> +/*
> >> + * RequestCustodian
> >> + * Called to request a custodian task.
> >> + */
> >> +void
> >> +RequestCustodian(int flags)
> >> +{
> >> + SpinLockAcquire(&CustodianShmem->cust_lck);
> >> + CustodianShmem->cust_flags |= flags;
> >> + SpinLockRelease(&CustodianShmem->cust_lck);
> >> +
> >> + if (ProcGlobal->custodianLatch)
> >> + SetLatch(ProcGlobal->custodianLatch);
> >> +}
> >
> > With this representation we can't really implement waiting for a task or
> > such. And it doesn't seem like a great API for the caller to just specify a
> > mix of flags.
>
> At the moment, the idea is that nothing should need to wait for a task
> because the custodian only handles things that are relatively non-critical.

Which is just plainly not true as the patchset stands...

I think we're going to have to block if some cleanup as part of a checkpoint
hasn't been completed by the next checkpoint - otherwise it'll just end up
being way too confusing and there's absolutely no backpressure anymore.

> If that changes, this could probably be expanded to look more like
> RequestCheckpoint().
>
> What would you suggest using instead of a mix of flags?

I suspect an array of tasks with requested and completed counters or such?
With a condition variable to wait on?

> >> + /* let the custodian know what it can remove */
> >> + CustodianSetLogicalRewriteCutoff(cutoff);
> >
> > Setting this variable in a custodian datastructure and then fetching it from
> > there seems architecturally wrong to me.
>
> Where do you think it should go? I previously had it in the checkpointer's
> shared memory, but you didn't like that the functions were declared in
> bgwriter.h (along with the other checkpoint stuff). If the checkpointer
> shared memory is the right place, should we create checkpointer.h and use
> that instead?

Well, so far I have not understood what the whole point of the shared state
is, so i have a bit of a hard time answering this ;)

> >> +/*
> >> + * Remove all mappings not needed anymore based on the logical restart LSN saved
> >> + * by the checkpointer. We use this saved value instead of calling
> >> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
> >> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
> >> + * disk.
> >> + */
> >
> > What interference could there be?
>
> My concern is that the custodian could obtain a later cutoff than what the
> checkpointer does, which might cause files to be concurrently unlinked and
> fsync'd. If we always use the checkpointer's cutoff, that shouldn't be a
> problem. This could probably be better explained in this comment.

How about having a Datum argument to RequestCustodian() that is forwarded to
the task?

> >> +void
> >> +RemoveOldLogicalRewriteMappings(void)
> >> +{
> >> + XLogRecPtr cutoff;
> >> + DIR *mappings_dir;
> >> + struct dirent *mapping_de;
> >> + char path[MAXPGPATH + 20];
> >> + bool value_set = false;
> >> +
> >> + cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
> >> + if (!value_set)
> >> + return;
> >
> > Afaics nothing clears values_set - is that a good idea?
>
> I'm using value_set to differentiate the case where InvalidXLogRecPtr means
> the checkpointer hasn't determined a value yet versus the case where it
> has. In the former, we don't want to take any action. In the latter, we
> want to unlink all the files. Since we're moving to a request model for
> the custodian, I might be able to remove this value_set stuff completely.
> If that's not possible, it probably deserves a better comment.

It would.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-07-03 18:00:59 Re: 15beta1 tab completion of extension versions
Previous Message Nathan Bossart 2022-07-03 17:17:42 Re: replacing role-level NOINHERIT with a grant-level option