Quick Links

Fwd: restore_command on high-throughput cluster never switches to streaming replication

From:	Kasper Føns <kasper(dot)fons(at)cloudkitchens(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Fwd: restore_command on high-throughput cluster never switches to streaming replication
Date:	2025-12-01 09:49:42
Message-ID:	CANOng2i6xLa-FsN1B_rZFpW807GrV3YUJVgDM3nqJEj1gCk2dg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin pgsql-general

Hi PostgreSQL community.

I debugged an instance where a PostgreSQL standby would not switch to
streaming replication when the `restore_command` fails.
I first posted this to pgsql-admin mailing list, but now trying here as I
got no response.

*Expectation*
I expect PostgreSQL to try switching to streaming replication if the
`restore_command` fails.

*What happens*
PostgreSQL attempts to restore the previously restored WAL segment and then
retries the failed segment. However, because the primary produces WAL at a
high rate, the WAL file now exists and PostgreSQL does not try to switch to
streaming replication.

*Context*
Running PostgreSQL 15.7 in Kubernetes using CloudNative PostgreSQL Operator.

*Logs*
I configured PostgreSQL to emit DEBUG3 level logs. Newest logs first,
oldest last.

got WAL segment from archive
executing restore command "/controller/manager wal-restore
--log-destination /controller/log/postgres.json *000000410000A7BA00000058*
pg_wal/RECOVERYXLOG"
got WAL segment from archive
executing restore command "/controller/manager wal-restore
--log-destination /controller/log/postgres.json *000000410000A7BA00000057*
pg_wal/RECOVERYXLOG"
could not open file "pg_wal/*000000410000A7BA00000058*": No such file or
directory
could not restore file "*000000410000A7BA00000058*" from archive: child
process exited with exit code 1
executing restore command "/controller/manager wal-restore
--log-destination /controller/log/postgres.json *000000410000A7BA00000058*
pg_wal/RECOVERYXLOG"
got WAL segment from archive
executing restore command "/controller/manager wal-restore
--log-destination /controller/log/postgres.json *000000410000A7BA00000057*
pg_wal/RECOVERYXLOG"

Notice that when *000000410000A7BA00000058* failed, PostgreSQL asked for
*000000410000A7BA00000057* which it had already restored. Aftwards, it asks
about *000000410000A7BA00000058* once again.

*Problem*
This is problematic because the standby will never switch to streaming
replication.

*Workaround*
We can get the PostgreSQL replica to become in-sync if we change the
command to `/bin/false` when we are withing `wal_keep_size`.

*Question*
Is this the expected behaviour?

I expect the function `WaitForWALToBecomeAvailable` to switch to streaming
replication once a single `restore_command` fails. This also happens when
`/bin/false` is used instead.

Any help would be greatly appreciated
/Kasper Føns

In response to

restore_command on high-throughput cluster never switches to streaming replication at 2025-11-24 13:46:26 from Kasper Føns

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Jean-Christophe BOGGIO	2025-12-01 13:37:43	Importing a Windows database (in en_GB.CP1252) to linux
Previous Message	Laurenz Albe	2025-12-01 09:16:21	Re: Migration from MSSQL to POSTGRESQL

Browse pgsql-general by date

	From	Date	Subject
Next Message	hubert depesz lubaczewski	2025-12-01 10:10:14	Re: How to use index in simple select
Previous Message	Adrian Klaver	2025-12-01 03:23:12	Re: Check whether a NOT NULL check constraint has been validated