[PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

From: Nitin Motiani <nitinmotiani(at)google(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
Date: 2025-09-08 09:33:00
Message-ID: CAH5HC95iCZMCP20zKnO=BEwmhJ7LxPL6-PMhW6+ph8oeReUzDw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Hackers,

I'd like to propose a patch to allow accepting connections post recovery
without waiting for the removal of old xlog files.

*Why* : We have seen instances where the *crash recovery takes very long
(tens of minutes to hours) *if a large number of accumulated WAL files need
to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).

This WAL accumulation is usually caused by :

1. Inactive replication slot
2. PITR failing to keep up

In the above cases when the resolution (deleting inactive slot/disabling
PITR) is followed by a crash (before checkpoint could run), we see the
recovery take a very long time. Note that in these cases the actual WAL
replay is done relatively quickly and most of the delay is due to
RemoveOldXlogFiles().

*How* : This patch solves this issue by running RemoveOldXlogFiles()
separately and async. This is achieved by doing two things :

1. *Skip RemoveOldXlogFiles() for an END_OF_RECOVERY checkpoint*. This will
ensure that the recovery finishes sooner and postgres can start accepting
connections.
2. *After the recovery we run another checkpoint without CHECKPOINT_WAIT*.
This is done in StartupXLOG(). This will lead to some extra work but that
should be minuscule as it is run right after the recovery. And the majority
of work done by this checkpoint will be in RemoveOldXlogFiles() which can
now run asynchronously.

I considered a couple of *alternative solutions* before attempting this.

1. One option could be to simply skip the removal of old xlog files during
recovery and let a later checkpoint take care of that. But in case of large
checkpoint_timeout, this could lead to bloat for longer.

2. Another approach might be to separate out RemoveOldXlogFiles() in a new
request. This might also be doable by creating a special checkpoint flag
like CHECKPOINT_ONLY_DELETE_OLD_FILES and using that in
RequestCheckpoint(). This way we can have the second checkpoint only take
care of file deletion. I ended up picking my approach over this because
that can be done with a smaller change which might make it safer and less
error-prone.

I would like to know what folks think of these alternative approaches vs
the current one.

*Repro Steps* : To repro this, I inserted and deleted a few billion rows
while keeping an inactive replication slot at the publisher. I changed the
wal_segsize to 1MB to increase the number of files for a smaller amount of
data. With 1.5TB worth of WAL files, I could consistently reproduce a 40
minutes delay. With 300GB it was around 10 minutes. With the proposed patch
the connections started being accepted right after redo is done.

These are the steps to reproduce this.

1. I created the instance with following settings :

wal-segsize=1MB (pg_ctl -D $PUB_DATA init -o --wal-segsize=1)

checkpoint_timeout=86400 to stop periodic checkpoint from running (echo
"checkpoint_timeout = 86400" >> $PUB_DATA/postgresql.conf)

max_wal_size=102400 to avoid too many checkpoints during transactions (echo
"max_wal_size = 102400" >> $PUB_DATA/postgresql.conf)

2. I created a database test and then ran the following commands :

create table t_pub(id int);
alter table t_pub replica identity full;
create publication p;
alter publication p add table t_pub;

3. I created a subscriber instance (to create an inactive replication
slot). I didn't change any config settings for this instance. Here also I
created a db test and ran the following commands :

create table t_pub(id int);
create subscription s connection 'application_name=test host=localhost
user=nitinmotiani dbname=test port=5001' publication p;
alter subscription s refresh publication;

4. I stopped the subscriber instance and checked on the first instance
(publisher) that there was an inactive replication slot by running the
following command :

select slot_name, active from pg_replication_slots;

5. I inserted and deleted data by running the following pair of commands
multiple times :

insert into t_pub select generate_series(1, 2000000000);
delete from t_pub;

Depending on the number of times these are run, we can get different
amounts of data in the WAL directory.

6. I dropped the replication slot using the following :

select pg_drop_replication_slot('s');

7. I killed one of the postgres processes to trigger crash recovery.

By checking the logs, I confirmed that the patch reduces the crash recovery
time significantly. With the patch as the removal of old xlog files was
going on in async, I also ran a few queries, created another table,
inserted data etc and it was all working.

I'm attaching the patch here. Currently I have not added any tests as I
would like to get feedback on this approach to solve the problem vs the
alternatives. Please let me know what you think.

Thanks & Regards,
Nitin Motiani
Google

Attachment Content-Type Size
v1-0001-Accept-connections-post-recovery-without-waiting-.patch application/octet-stream 5.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2025-09-08 09:40:28 Re: Parallel Apply
Previous Message Alexander Kukushkin 2025-09-08 09:25:46 Re: issue with synchronized_standby_slots