Re: O(n) tasks cause lengthy startups and checkpoints

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-01-18 20:00:41
Message-ID: 83A7426F-A7B6-4A10-A8F0-179AE30B871D@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/14/22, 11:26 PM, "Bharath Rupireddy" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
>> I'd personally like to avoid creating two code paths for the same
>> thing. Are there really cases when this one extra auxiliary process
>> would be too many? And if so, how would a user know when to adjust
>> this GUC? I understand the point that we should introduce new
>> processes sparingly to avoid burdening low-end systems, but I don't
>> think we should be afraid to add new ones when it is needed.
>
> IMO, having a GUC for enabling/disabling this new worker and it's
> related code would be a better idea. The reason is that if the
> postgres has no replication slots at all(which is quite possible in
> real stand-alone production environments) or if the file enumeration
> (directory traversal and file removal) is fast enough on the servers,
> there's no point having this new worker, the checkpointer itself can
> take care of the work as it is doing today.

IMO introducing a GUC wouldn't be doing users many favors. Their
cluster might work just fine for a long time before they begin
encountering problems during startups/checkpoints. Once the user
discovers the underlying reason, they have to then find a GUC for
enabling a special background worker that makes this problem go away.
Why not just fix the problem for everybody by default?

I've been thinking about what other approaches we could take besides
creating more processes. The root of the problem seems to be that
there are a number of tasks that are performed synchronously that can
take a long time. The process approach essentially makes these tasks
asynchronous so that they do not block startup and checkpointing. But
perhaps this can be done in an existing process, possibly even the
checkpointer. Like the current WAL pre-allocation patch, we could do
this work when the checkpointer isn't checkpointing, and we could also
do small amounts of work in CheckpointWriteDelay() (or a new function
called in a similar way). In theory, this would help avoid delaying
checkpoints too long while doing cleanup at every opportunity to lower
the chances it falls far behind.

Nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-18 20:01:05 Re: Push down time-related SQLValue functions to foreign server
Previous Message Tom Lane 2022-01-18 19:11:30 Re: Push down time-related SQLValue functions to foreign server