Re: O(n) tasks cause lengthy startups and checkpoints

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-12-06 07:28:28
Message-ID: CALj2ACW1F-T866iaqA_9h9fA+Axdyc86CHz7_k55RGH5HH2rtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 3, 2022 at 12:45 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> >> The test appears to reliably create snapshot and mapping files, so if the
> >> directories are empty at some point after the checkpoint at the end, we can
> >> be reasonably certain the custodian took action. I didn't add explicit
> >> checks that there are files in the directories before the checkpoint
> >> because a concurrent checkpoint could make such checks unreliable.
> >
> > I think you're right. I added sqls to see if the snapshot and mapping
> > files count > 0, see [1] and the cirrus-ci members are happy too -
> > https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> > think we can consider adding these count > 0 checks to tests.
>
> My worry about adding "count > 0" checks is that a concurrent checkpoint
> could make them unreliable. In other words, those checks might ordinarily
> work, but if an automatic checkpoint causes the files be cleaned up just
> beforehand, they will fail.

Hm. It would have been better with a TAP test module for testing the
custodian code reliably. Anyway, that mustn't stop the patch getting
in. If required, we can park the TAP test module for later - IMO.
Others may have different thoughts here.

> > Having said above, I'm okay to process ShutdownRequestPending as early
> > as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> > alongside ShutdownRequestPending?
>
> I'm not seeing a need for CHECK_FOR_INTERRUPTS. Do you see one?

Since the custodian has SignalHandlerForShutdownRequest as SIGINT and
SIGTERM handlers, unlike StatementCancelHandler and die respectively,
no need of CFI I guess. And also none of the CFI signal handler flags
applies to the custodian.

> > While thinking about this, one thing that really struck me is what
> > happens if we let the custodian exit, say after processing
> > ShutdownRequestPending immediately or after a restart, leaving other
> > queued tasks? The custodian will never get to work on those tasks
> > unless the requestors (say checkpoint or some other process) requests
> > it to do so after restart. Maybe, we don't need to worry about it.
> > Maybe we need to worry about it. Maybe it's an overkill to save the
> > custodian's task state to disk so that it can come up and do the
> > leftover tasks upon restart.
>
> Yes, tasks will need to be retried when the server starts again. The ones
> in this patch set should be requested again during the next checkpoint.
> Temporary file cleanup would always be requested during server start, so
> that should be handled as well. Even today, the server might abruptly shut
> down while executing these tasks, and we don't have any special handling
> for that.

Right.

The v18 patch set posted upthread
https://www.postgresql.org/message-id/20221201214026.GA1799688%40nathanxps13
looks good to me. I see the CF entry is marked RfC -
https://commitfest.postgresql.org/41/3448/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-12-06 07:50:24 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Kyotaro Horiguchi 2022-12-06 07:27:50 Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures