Re: O(n) tasks cause lengthy startups and checkpoints

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-11-27 23:34:34
Message-ID: 20221127233434.GA878043@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for taking a look!

On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote:
> * not sure I believe that everything it does can always be aborted out
> of and shutdown - to achieve that you will need a
> CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least

I did something like this earlier, but was advised to simply let the
functions finish as usual during shutdown [0]. I think this is what the
checkpointer process does today, anyway.

> * not sure why you want immediate execution of custodian tasks - I
> feel supporting two modes will be a lot harder. For me, I would run
> locally when !IsUnderPostmaster and also in an Assert build, so we can
> test it works right - i.e. running in its own process is just a
> production optimization for performance (which is the stated reason
> for having this)

I added this because 0004 involves requesting a task from the postmaster,
so checking for IsUnderPostmaster doesn't work. Those tasks would always
run immediately. However, we could use IsPostmasterEnvironment instead,
which would allow us to remove the "immediate" argument. I did it this way
in v14.

I'm not sure about running locally in Assert builds. It's true that would
help ensure there's test coverage for the task logic, but it would also
reduce coverage for the custodian logic. And in general, I'm worried about
having Assert builds use a different code path than production builds.

> 0005 seems good from what I know
> * There is no check to see if it worked in any sane time

What did you have in mind? Should the custodian begin emitting WARNINGs
after a while?

> * It seems possible that "Old" might change meaning - will that make
> it break/fail?

I don't believe so.

> Rather than explicitly use DEBUG1 everywhere I would have an
> #define CUSTODIAN_LOG_LEVEL LOG
> so we can run with it in LOG mode and then set it to DEBUG1 with a one
> line change in a later phase of Beta

I can create a separate patch for this, but I don't think I've ever seen
this sort of thing before. Is the idea just to help with debugging during
the development phase?

> I can't really comment with knowledge on sub-patches 0002 to 0004.
>
> Perhaps you should aim to get 1, 5, 6 committed first and then return
> to the others in a later CF/separate thread?

That seems like a good idea since those are all relatively self-contained.
I removed 0002-0004 in v14.

[0] https://postgr.es/m/20220217065938.x2esfdppzypegn5j%40alap3.anarazel.de

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v14-0001-Introduce-custodian.patch text/x-diff 24.3 KB
v14-0002-Move-removal-of-old-serialized-snapshots-to-cust.patch text/x-diff 4.5 KB
v14-0003-Move-removal-of-old-logical-rewrite-mapping-file.patch text/x-diff 9.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-11-27 23:45:28 Re: wake up logical workers after ALTER SUBSCRIPTION
Previous Message Andrey Borodin 2022-11-27 21:29:18 Re: Amcheck verification of GiST and GIN