Re: O(n) tasks cause lengthy startups and checkpoints

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Date: 2022-02-17 18:23:37
Message-ID: 20220217182337.GA3247866@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
> On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
>> >> - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> >> + while (!ShutdownRequestPending &&
>> >> + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> >
>> > Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
>> > not do their jobs when ShutdownRequestPending is set, particularly without a
>> > huge fat comment.
>>
>> The idea was to avoid delaying shutdown because we're waiting for the
>> custodian to finish relatively nonessential tasks. Another option might be
>> to just exit immediately when the custodian receives a shutdown request.
>
> I think we should just not do either of these and let the functions
> finish. For the cases where shutdown really needs to be immediate
> there's, uhm, immediate mode shutdowns.

Alright.

>> > Why does this not open us up to new xid wraparound issues? Before there was a
>> > hard bound on how long these files could linger around. Now there's not
>> > anymore.
>>
>> Sorry, I'm probably missing something obvious, but I'm not sure how this
>> adds transaction ID wraparound risk. These files are tied to LSNs, and
>> AFAIK they won't impact slots' xmins.
>
> They're accessed by xid. The LSN is just for cleanup. Accessing files
> left over from a previous transaction with the same xid wouldn't be
> good - we'd read wrong catalog state for decoding...

Okay, that part makes sense to me. However, I'm still confused about how
this is handled today and why moving cleanup to a separate auxiliary
process makes matters worse. I've done quite a bit of reading, and I
haven't found anything that seems intended to prevent this problem. Do you
have any pointers?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-17 18:33:03 Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM
Previous Message Andres Freund 2022-02-17 18:19:38 Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM