Re: O(n) tasks cause lengthy startups and checkpoints

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-12-02 19:15:07
Message-ID: 20221202191507.GA2277157@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>> The test appears to reliably create snapshot and mapping files, so if the
>> directories are empty at some point after the checkpoint at the end, we can
>> be reasonably certain the custodian took action. I didn't add explicit
>> checks that there are files in the directories before the checkpoint
>> because a concurrent checkpoint could make such checks unreliable.
>
> I think you're right. I added sqls to see if the snapshot and mapping
> files count > 0, see [1] and the cirrus-ci members are happy too -
> https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> think we can consider adding these count > 0 checks to tests.

My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable. In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.

> Having said above, I'm okay to process ShutdownRequestPending as early
> as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> alongside ShutdownRequestPending?

I'm not seeing a need for CHECK_FOR_INTERRUPTS. Do you see one?

> While thinking about this, one thing that really struck me is what
> happens if we let the custodian exit, say after processing
> ShutdownRequestPending immediately or after a restart, leaving other
> queued tasks? The custodian will never get to work on those tasks
> unless the requestors (say checkpoint or some other process) requests
> it to do so after restart. Maybe, we don't need to worry about it.
> Maybe we need to worry about it. Maybe it's an overkill to save the
> custodian's task state to disk so that it can come up and do the
> leftover tasks upon restart.

Yes, tasks will need to be retried when the server starts again. The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well. Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-12-02 19:21:01 Re: wake up logical workers after ALTER SUBSCRIPTION
Previous Message Corey Huinker 2022-12-02 19:06:09 Re: Error-safe user functions