From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Nitin Motiani <nitinmotiani(at)google(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles |
Date: | 2025-09-09 06:58:25 |
Message-ID: | CAA4eK1LANwLdEhavTfTtmOD8LJ8uUoMY7FtPX_3YF7ge=Z7TcA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani(at)google(dot)com> wrote:
>
> I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.
>
> Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).
>
> This WAL accumulation is usually caused by :
>
> 1. Inactive replication slot
> 2. PITR failing to keep up
>
> In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpoint could run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relatively quickly and most of the delay is due to RemoveOldXlogFiles().
>
Isn't it better to fix the reasons for WAL accumulation? Because even
without recovery, this can fill up the disk. For example, one can use
idle_replication_slot_timeout for inactive slots. Similarly, we can
see what leads to slow PITR and try to avoid that.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Andrei Lepikhov | 2025-09-09 07:02:01 | Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6 |
Previous Message | Dilip Kumar | 2025-09-09 06:37:19 | Re: Adding pg_dump flag for parallel export to pipes |