Re: [PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, johnhyvr(at)gmail(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Kirill Reshke <reshkekirill(at)gmail(dot)com>
Subject: Re: [PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion
Date: 2025-10-31 18:13:56
Message-ID: AF966153-0413-41FC-B5C2-5CB9A6F645A9@yandex-team.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 24 Oct 2025, at 03:19, John H <johnhyvr(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Thu, Oct 23, 2025 at 9:25 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>>
>> Hi hackers,
>>
>> I'd like to propose a new archive_mode setting to address a gap in WAL
>> archiving for high availability streaming replication configurations.
>>
>> In HA setups using streaming replication, standbys can be
>> promoted when primary has failed. Some WAL segments might be not yet
>> archived. This creates gaps in the WAL archive, breaking point-in-time
>> recovery:
>>
>> 1. Primary generates WAL, streams to standby
>> 2. Standby receives WAL, marks segments as .done immediately
>
> +1 to the idea.
> If I understand correctly, the assumption we're making is that the Standby
> doesn't really "archive" just makes it as .done, even though in theory
> it could do the same
> thing as the primary and avoid this issue. It would be wasted work if
> the primary and replica
> archives the same WAL and that's what we want to avoid?

Yes, I'd like to avoid costs of archiving same file many times. And cost of requesting storage if given file is archived.

>>
>> ## Implementation
>>
>> The patch adds two replication protocol messages:
>> - 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, segno) pairs
>> - 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with archived pairs
>>
>
> I might be missing something but isn't it enough for the writer to
> send the last_archived_wal
> in PgStat_ArchiverStats? That way we can avoid doing the full
> directory scan of archive_status.
> Or do we not feel comfortable assuming that WAL files are archived in order?

AFAIU archiver archives in order of reading archive_status directory, e.i. random order in worst case.
Anyway, we could send .done signals to standby, but we cannot be sure given standby already have WAL for which we are commanding him to avoid archiving it... And standby might have these WALs from archive already, thus not needing .done file at all.

So, I implemented basic design that works for worst case. We can add some heuristics on top, but them must be negligible cheap in any possible archiving scenario.

> On 27 Oct 2025, at 10:26, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Fri, Oct 24, 2025 at 1:25 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>>
>> Hi hackers,
>>
>> I'd like to propose a new archive_mode setting to address a gap in WAL
>> archiving for high availability streaming replication configurations.
>>
>> ## Problem
>>
>> In HA setups using streaming replication, standbys can be
>> promoted when primary has failed. Some WAL segments might be not yet
>> archived. This creates gaps in the WAL archive, breaking point-in-time
>> recovery:
>>
>> 1. Primary generates WAL, streams to standby
>> 2. Standby receives WAL, marks segments as .done immediately
>> 3. Standby deletes WAL during checkpoints
>> 4. Primary hasn't archived yet (archiver lag, network issues, etc.)
>> 5. Primary vanishes
>> 6. Standby gets promoted
>> 7. WAL history lost from archive
>>
>> This is particularly problematic in synchronous replication where
>> promotion might happen while the primary is still catching up on archival.
>>
>> Promoted standby might have some WALs from walreceiver, some from archive. In
>> this case we need to archive only those WALs which were received, but not
>> confirmed to be archived by primary.
>>
>> ## Proposed Solution
>>
>> Add archive_mode=follow_primary, where standbys defer WAL deletion until
>> the primary confirms archival:
>
> Can't we achieve nearly the same behavior by setting archive_mode to
> always and configuring archive_command on the standby to check
> whether the WAL file already exists in the shared archive area
> (e.g., test -f <archive directory>/%f (probably also the WAL file size
> should be checked))? In this setup, archive_command would fail
> until the WAL file appears in the archive, preventing the standby
> from removing it while the command is failing.

Many storages charge for request. If archive tool issues HEAD request to S3 - it might costs user some money.
Other storages cap frequency of requests at some RPS. In worst case we might affect archiving capabilities of primary.

The key idea here is that archive storage might be a disaster recovery system that is optimized for storing data, but not for listing this data frequently. So the cluster should not delegate archive_status function to some distant storage if it can be cheaply tracked within HA cluster internally.

Thanks for your interest!

Best regards, Andrey Borodin.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2025-10-31 18:14:38 Re: meson's in-tree libpq header search order vs -Dextra_include_dirs
Previous Message Tom Lane 2025-10-31 18:09:48 Re: Should HashSetOp go away